Below is a comprehensive, step‐by‐step plan that combines both concepts—how to create uniform-length sequences (via padding or DTW warping) and how to segment your data (whole-workout sliding windows versus per-trial segmentation). This plan is designed to use a sliding window approach as a strong long‑term injury risk predictor (by capturing cumulative fatigue and biomechanical changes across workouts), while leveraging per‑trial segmentation to provide more immediate, detailed snapshots of injury risk.

---

## 1. Overview

- **Sliding Window (Whole-Workout) Segmentation:**  
  - **Goal:** Capture long-term trends (e.g., accumulating fatigue and gradual increases in valgus torque) by extracting overlapping fixed-length windows from continuous workout data.  
  - **Strength:** Ideal for predicting overall performance and injury risk over time.
  
- **Per-Trial Segmentation:**  
  - **Goal:** Isolate individual motion cycles (e.g., a single pitch) to capture fine-grained technical and biomechanical details.  
  - **Strength:** Better suited for immediate injury risk prediction.

- **Uniform Sequence Length (Padding vs. DTW Warping):**  
  - **Padding:** Simple method that appends zeros (or another constant) to shorter sequences to match the longest sequence.  
  - **DTW Warping:** A more nuanced technique that non-linearly aligns each sequence to a reference (typically the longest) by stretching or compressing time steps, thereby preserving key temporal dynamics.

---

## 2. Detailed Step-by-Step Pipeline

### **Step 1: Data Collection**
- **Collect Continuous Data:**  
  Gather sensor data (e.g., motion capture, IMU data, video features) during entire workouts.
- **Record Metadata:**  
  Log categorical features like one‑hot encoded shooting phases, workout type, and baseline fitness levels.

### **Step 2: Data Preprocessing**
- **Clean and Filter Data:**  
  Remove noise and artifacts from sensor readings.
- **Normalize Data:**  
  Normalize numeric features so they’re on comparable scales.
- **Encode Categorical Variables:**  
  One‑hot encode shooting phases and workout types.
- **Time Alignment:**  
  Ensure all data points are synchronized in time.

### **Step 3: Sequence Segmentation**
- **Whole-Workout Segmentation (Sliding Windows):**  
  - **Define Window Size & Step Size:**  
    Decide on a fixed length (e.g., 5 minutes or a set number of time steps) and a step size (e.g., slide every minute, possibly with overlap).
  - **Extract Windows:**  
    Slide the window over the continuous time series to generate multiple overlapping sub-sequences. Each window carries both dynamic sensor data and static/categorical features.
  
- **Per-Trial Segmentation (Optional):**  
  - **Extract Trials:**  
    Identify and segment individual motion cycles (e.g., single pitches) to analyze immediate biomechanical dynamics.

### **Step 3A: Uniform Sequence Length Normalization**
- **Determine Target Length:**  
  Use the length of the longest sequence (or a predefined maximum) as the reference.
  
- **Option A: Padding**  
  - For any sequence shorter than the target, append zeros (or a constant value) at the end until the length matches the target.
  - **Best Use Case:** When sequence length variation is minor or when “empty” padding doesn’t harm model performance.
  
- **Option B: DTW Warping**  
  - Compute a Dynamic Time Warping (DTW) path between each sequence and the reference sequence.
  - **Warp the Sequence:**  
    Stretch or compress the time steps so that each sequence exactly matches the target length.
  - **Best Use Case:** When preserving and aligning the key temporal dynamics is critical.
  
> **Note:** These two methods are mutually exclusive. Your pipeline should use either padding (with `use_dtw = False`) or DTW warping (with `use_dtw = True`).

### **Step 4: Feature Engineering**
- **Combine Features:**  
  Create a feature vector for each window (or trial) that includes:
  - Time-series sensor data.
  - One‑hot encoded shooting phases (dynamic within the window).
  - Static features like workout type and initial fitness level (either replicated across time steps or appended as global features).
- **Sequence Labeling:**  
  Label each sequence with the target output (e.g., predicted work capacity, power output, or risk level based on valgus torque).

### **Step 5: Model Architecture Selection**
- **Select a Recurrent Model:**  
  An LSTM or GRU is well-suited for time-series forecasting.
- **Input Structure:**  
  Design the model to accept input shaped as (window_length, number_of_features).  
- **Consider Hybrid Models:**  
  Optionally incorporate convolutional layers to capture local patterns along with recurrent layers for longer-term dependencies.

### **Step 6: Model Training**
- **Data Splitting:**  
  Divide data into training, validation, and test sets in a way that respects temporal order (e.g., leaving whole workouts out for testing).
- **Loss Function & Metrics:**  
  Use appropriate loss functions (like mean squared error for continuous outputs) and evaluation metrics.
- **Regularization:**  
  Apply dropout or other regularization techniques to mitigate overfitting, especially with overlapping windows.
- **Training Strategy:**  
  Train the model on your sequences (windows or trials), monitoring performance and tuning hyperparameters accordingly.

### **Step 7: Model Evaluation**
- **Evaluate Performance:**  
  Test on whole-workout sequences or withheld workouts to assess overall performance trends.
- **Compare Segmentation Approaches:**  
  Optionally, compare results between per-trial and whole-workout segmentation:
  - **Sliding Window (Whole-Workout):** Better for long-term, cumulative injury risk prediction.
  - **Per-Trial:** Provides more detailed, immediate injury risk insights.
- **Interpret Predictions:**  
  Analyze model outputs in the context of fatigue trends and biomechanical shifts, ensuring that both immediate and cumulative injury risks are captured.

### **Step 8: Deployment and Ongoing Prediction**
- **Real-Time Implementation:**  
  Continuously feed new sensor data into a sliding window buffer for ongoing predictions.
- **Dynamic Feature Updates:**  
  Allow for periodic updates to static features like fitness level to keep the model current.
- **Feedback Loop:**  
  Use predictions to inform training adjustments or real-time interventions aimed at injury prevention.

---

## 3. Injury Risk Prediction Strategy

- **Long-Term (Cumulative) Injury Risk:**  
  - **Method:** Whole-workout segmentation with a sliding window.
  - **Advantage:** Captures progressive fatigue and cumulative biomechanical changes (e.g., increasing valgus torque) that suggest an elevated long-term injury risk.

- **Immediate (Acute) Injury Risk:**  
  - **Method:** Per-trial segmentation.
  - **Advantage:** Provides high-resolution insight into individual motion cycles, allowing for the detection of sudden deviations or technical errors that may indicate imminent injury.

---

## Final Summary

1. **Data Pipeline:**
   - Collect continuous sensor data and categorical metadata.
   - Preprocess data with cleaning, normalization, encoding, and time alignment.

2. **Segmentation:**
   - Use a sliding window approach for long-term monitoring.
   - Optionally segment individual trials for immediate risk detection.

3. **Uniform Sequence Length:**
   - **Padding:** Simple zero-padding for minor length variations.
   - **DTW Warping:** Non-linear alignment to preserve temporal dynamics.
   - Choose one method (controlled via a configuration flag).

4. **Feature Engineering & Modeling:**
   - Combine dynamic sensor data with static features.
   - Use LSTM/GRU or hybrid models for capturing both short-term details and long-term trends.

5. **Training, Evaluation, and Deployment:**
   - Split data respecting time sequences.
   - Train with appropriate loss functions and regularization.
   - Evaluate both segmentation methods to determine which best predicts overall injury risk (sliding window for cumulative trends, per-trial for immediate risk).
   - Deploy for real-time monitoring and update features dynamically.

By integrating these strategies, you create a robust system where the sliding window approach drives long-term injury risk predictions, and per-trial segmentation offers a finer, immediate risk assessment. This dual strategy enables timely interventions and a comprehensive view of athlete performance and injury risk.





Below is an integrated, comprehensive vision and step‑by‑step guide for time series preprocessing tailored to injury risk prediction in sports analytics. This guide combines our existing dual segmentation and uniform sequence handling concepts with advanced techniques from recent literature and industry best practices. The goal is to create a robust, flexible, and modular pipeline that supports both regression and classification tasks while addressing the unique challenges of sports biomechanics data.

---

## Our Integrated Vision

Our pipeline will seamlessly switch between:
  
1. **Whole‑Workout Segmentation (Sliding Window Approach):**  
   - **Purpose:** Capture cumulative, long‑term trends (e.g., progressive fatigue or increasing valgus torque) over an entire workout.  
   - **Method:** Extract overlapping, fixed‑length windows from continuous sensor data, ensuring both local dynamics and global progression are captured.

2. **Per‑Trial Segmentation (Using Sequence Categorical):**  
   - **Purpose:** Isolate individual motion cycles or trials to capture acute, fine‑grained biomechanical details that may signal immediate injury risk.  
   - **Method:** Group data based on a trial identifier (via a parameter such as `sequence_categorical`) and process each group as an independent sequence.

In both cases, we guarantee that every sequence is uniform in length using either:
  
- **Padding:** Append zeros (or another constant) to shorter sequences based on the longest sequence in the set.  
- **Dynamic Time Warping (DTW) Warping:** Optionally non‑linearly align sequences to a reference length to preserve temporal dynamics—acknowledging that DTW may boost accuracy at the expense of increased computational cost.  
- **Hybrid or Learned Alignment Alternatives:** Explore attention‑based alignment layers or transformer-based models that natively handle variable-length inputs for future iterations.

Beyond segmentation and sequence standardization, our vision incorporates advanced preprocessing techniques to further enhance our model’s performance:

- **Sensor Fusion & Metadata Integration:** Combine multimodal sensor data (IMU, motion capture, video) with contextual metadata (e.g., workout type, baseline fitness, trial identifiers).  
- **Adaptive and Domain-Specific Preprocessing:**  
  - **Noise Filtering and Imputation:** Use standard methods (SimpleImputer/KNNImputer) but consider domain-specific methods like wavelet‑based denoising or biomechanically informed imputation for features such as arm rotation angles.  
  - **Outlier Detection:** Apply Isolation Forest or custom contamination thresholds tuned for sports biomechanics data (e.g., filtering biomechanically implausible joint angles).  
  - **Normalization Strategies:** In addition to StandardScaler or MinMaxScaler, consider transformations such as the Yeo‑Johnson PowerTransformer to better handle non‑stationary and skewed data.
  
- **Temporal Decomposition & Adaptive Resampling:**  
  - Utilize multidimensional STL (MSTL) or adaptive differencing to capture trends and seasonality in sensor signals.  
  - Consider event‑driven or phase‑adaptive resampling techniques (e.g., detecting critical movement phases) to dynamically adjust window sizes for enhanced personalization.

- **Feature Engineering Enhancements:**  
  - Extract dynamic biomechanical features (angular velocity gradients, accelerations, jerks, ground reaction force asymmetry, spectral entropy) that capture movement quality and smoothness.  
  - Explore attention‑based fusion for multimodal data, combining sensor streams and static metadata to enrich the feature space.

- **Model Architecture Considerations:**  
  - While our pipeline initially supports LSTMs/GRUs, we will also explore alternatives such as Temporal Convolutional Networks (TCNs) and transformer architectures.  
  - Incorporate time‑aware SMOTE variants (e.g., TS‑SMOTE or ShapeDTW‑SMOTE) to address class imbalance in temporal data, preserving the temporal structure of synthetic samples.

- **Deployment Optimizations:**  
  - Implement real‑time processing (e.g., incremental DTW, sensor fusion via Kalman filters) and consider edge computing optimizations such as model quantization to enable low‑latency, mobile deployments.

---

## Detailed Step‑by‑Step Pipeline

### **Step 1: Data Collection and Input Configuration**
- **Data Acquisition:**  
  - Collect continuous sensor data (motion capture, IMU, video) throughout the workout.
  - Record relevant metadata: workout type, baseline fitness, and a trial identifier if available.
- **Configuration:**  
  - Use YAML configuration files to specify parameters such as:
    - Data paths, sensor column names, and categorical column names.
    - Time series parameters: `time_column`, `window_size`, `horizon`, `step_size`, and `max_sequence_length`.
    - New parameter: `sequence_categorical` (to specify trial segmentation when applicable).

### **Step 2: Data Preprocessing**
- **Cleaning and Filtering:**  
  - Remove noise and artifacts from raw sensor data.
  - Consider advanced denoising (e.g., wavelet-based methods) for preserving transient biomechanical features.
- **Missing Value Imputation:**  
  - Use SimpleImputer or KNNImputer for low-missingness data.
  - For more complex temporal patterns, consider iterative imputation (e.g., IterativeImputer with BayesianRidge) or even autoencoder-based imputation.
- **Outlier Detection:**  
  - Apply methods like Isolation Forest with sport-specific contamination thresholds.
- **Normalization and Scaling:**  
  - Normalize numerical features using StandardScaler or MinMaxScaler.
  - Optionally use PowerTransformer (Yeo‑Johnson) to better handle skewed data.
- **Categorical Encoding:**  
  - One‑hot or ordinal encode categorical variables such as workout type and shooting phases.
- **Time Alignment:**  
  - Synchronize sensor data using the specified `time_column`.

### **Step 3: Sequence Segmentation**
- **Decide on Segmentation Strategy:**
  - **Per‑Trial Segmentation:**  
    - If `sequence_categorical` (e.g., “trial”) is provided, group data by this identifier.
    - Call a dedicated method (e.g., `create_sequences_by_category`) to extract and handle sequences from each trial.
  - **Whole‑Workout Segmentation (Sliding Window):**  
    - If no grouping is specified, apply a fixed‑length sliding window over the entire workout to generate overlapping sub‑sequences.

### **Step 4: Uniform Sequence Length Handling**
- **Target Length Determination:**  
  - Determine the maximum sequence length among the extracted groups (or sliding windows).
- **Uniformity Methods:**  
  - **Padding:**  
    - Append zeros (or a constant value) to sequences shorter than the maximum length.
  - **DTW Warping:**  
    - If enabled (via `use_dtw`), compute a DTW warping path for each sequence and non‑linearly align it to the target length.
  - **Alternative Approaches:**  
    - Investigate constrained DTW (e.g., with Sakoe-Chiba banding) or learned alignment using attention layers for future iterations.
- **Configuration:**  
  - Control the process with parameters such as `use_dtw`, `padding_value`, and optional `dtw_params`.

### **Step 5: Feature Engineering**
- **Dynamic Feature Extraction:**  
  - Combine sensor time series data with static metadata.
  - Derive additional biomechanical features such as:
    - Angular velocity gradients (dθ/dt), accelerations, and jerk.
    - Ground reaction force asymmetry.
    - Movement smoothness metrics like spectral entropy.
- **Multimodal Fusion:**  
  - Explore advanced fusion techniques (e.g., cross‑modal attention) to integrate sensor streams with metadata.
- **Sequence Labeling:**  
  - Label each sequence with a target output (e.g., predicted work capacity or injury risk indicator).

### **Step 6: Model Architecture Selection**
- **Baseline Models:**  
  - Utilize LSTM/GRU networks for capturing temporal dependencies.
- **Alternative Architectures:**  
  - Evaluate Temporal Convolutional Networks (TCNs) for parallelizable training and larger receptive fields.
  - Consider transformer-based models with self‑attention for natively handling variable‑length sequences.
- **Parameterization:**  
  - Define input shape `(sequence_length, num_features)` and adjust hyperparameters (layers, hidden units, dropout) via configuration.

### **Step 7: Training and Evaluation**
- **Data Splitting:**  
  - Split data into training, validation, and test sets, ensuring temporal order is preserved (e.g., leave out entire workouts or trials).
- **Loss Functions and Metrics:**  
  - For regression, use metrics like RMSE; for classification, use accuracy, F1‑score, and AUC.
- **Regularization:**  
  - Use dropout and early stopping to prevent overfitting.
- **Comparative Evaluation:**  
  - Compare model performance using both segmentation methods:
    - **Sliding Window:** Suited for long‑term, cumulative risk prediction.
    - **Per‑Trial:** Provides high-resolution detection of immediate risk.
- **Class Imbalance:**  
  - Consider time-aware SMOTE variants (e.g., TS‑SMOTE, ShapeDTW‑SMOTE) to generate synthetic samples while preserving temporal structure.

### **Step 8: Deployment and Ongoing Prediction**
- **Real‑Time Data Processing:**  
  - Continuously update a sliding window buffer with new sensor data for real‑time predictions.
- **Dynamic Feature Updates:**  
  - Periodically refresh static features (e.g., updated fitness levels) to adapt the model.
- **Deployment Optimizations:**  
  - For edge deployment, apply model quantization (e.g., converting FP32 to INT8) and optimize streaming DTW or incremental alignment.
- **Feedback Loop:**  
  - Use model predictions to trigger training adjustments, load management, or real‑time athlete feedback.

---

## Summary

1. **Data Collection:**  
   - Gather multimodal sensor data and rich metadata (including trial identifiers) with proper configuration.
  
2. **Preprocessing:**  
   - Clean, impute, detect outliers, normalize, and encode data while synchronizing time series using the `time_column`.

3. **Segmentation:**  
   - Choose between per‑trial segmentation (using `sequence_categorical`) and sliding window segmentation, based on the prediction horizon (immediate vs. cumulative injury risk).

4. **Uniform Sequence Handling:**  
   - Standardize sequence lengths by padding or DTW warping (or hybrid alternatives) using the longest sequence as the target.

5. **Feature Engineering:**  
   - Combine sensor data with static metadata and derive biomechanically informed features to enrich the input.

6. **Model Architecture:**  
   - Implement time-series models (LSTM/GRU, TCN, or Transformers) with parameters driven by configuration files.

7. **Training & Evaluation:**  
   - Split data appropriately, use advanced loss functions and class imbalance methods (time-aware SMOTE), and compare segmentation strategies.

8. **Deployment:**  
   - Optimize for real‑time processing with model quantization and streaming alignment, and set up dynamic feedback loops for injury prevention.

This integrated approach, built on robust foundational methods and enhanced by advanced, domain‑specific techniques, provides the best overall options for preprocessing time series data in injury risk prediction. It ensures our pipeline is both flexible and adaptive, capable of handling varying sensor modalities and evolving sports analytics requirements.

Feel free to ask for further clarifications or additional details on any specific stage!

In [None]:
# datapreprocessor.py

import pandas as pd
import numpy as np
from typing import List, Dict, Optional, Tuple, Any
import logging
import os
from sklearn.model_selection import train_test_split
from sklearn.impute import SimpleImputer, KNNImputer
from sklearn.preprocessing import PowerTransformer, OrdinalEncoder, OneHotEncoder, StandardScaler, MinMaxScaler, RobustScaler
from sklearn.compose import ColumnTransformer
from sklearn.pipeline import Pipeline
from sklearn.metrics import mean_squared_error
from imblearn.over_sampling import SMOTE, ADASYN, SMOTENC, SMOTEN, BorderlineSMOTE
from imblearn.combine import SMOTEENN, SMOTETomek
from sklearn.ensemble import IsolationForest
from sklearn.neighbors import NearestNeighbors
from sklearn.metrics import pairwise_distances
import matplotlib.pyplot as plt
import seaborn as sns
from scipy.stats import probplot
import joblib  # For saving/loading transformers
from inspect import signature  # For parameter validation in SMOTE

class DataPreprocessor:
    def __init__(
        self,
        model_type: str,
        y_variable: List[str],
        ordinal_categoricals: List[str],
        nominal_categoricals: List[str],
        numericals: List[str],
        mode: str,  # 'train', 'predict', 'clustering'
        options: Optional[Dict] = None,
        debug: bool = False,
        normalize_debug: bool = False,
        normalize_graphs_output: bool = False,
        graphs_output_dir: str = './plots',
        transformers_dir: str = './transformers',
        # New time series parameters:
        time_column: Optional[str] = None,
        window_size: Optional[int] = None,
        horizon: Optional[int] = None,
        step_size: Optional[int] = None,
        max_sequence_length: Optional[int] = None,
        use_dtw: bool = False,
        # New parameter for per-trial segmentation
        sequence_categorical: Optional[List[str]] = None,
        # New: dynamic window adjustment flag
        dynamic_window_adjustment: bool = False
    ):
        """
        Initialize the DataPreprocessor with model type and feature lists.

        Args:
            model_type (str): Type of the machine learning model (e.g., 'Logistic Regression').
            y_variable (List[str]): List of target variable(s).
            ordinal_categoricals (List[str]): List of ordinal categorical features.
            nominal_categoricals (List[str]): List of nominal categorical features.
            numericals (List[str]): List of numerical features.
            mode (str): Operational mode ('train', 'predict', 'clustering').
            options (Optional[Dict]): User-defined options for preprocessing steps.
            debug (bool): General debug flag to control overall verbosity.
            normalize_debug (bool): Flag to display normalization plots.
            normalize_graphs_output (bool): Flag to save normalization plots.
            graphs_output_dir (str): Directory to save plots.
            transformers_dir (str): Directory to save/load transformers.
        """
        self.model_type = model_type
        self.y_variable = y_variable
        self.ordinal_categoricals = ordinal_categoricals
        self.nominal_categoricals = nominal_categoricals
        self.numericals = numericals
        self.mode = mode.lower()
        if self.mode not in ['train', 'predict', 'clustering']:
            raise ValueError("Mode must be one of 'train', 'predict', or 'clustering'.")
        self.options = options or {}
        self.debug = debug
        self.normalize_debug = normalize_debug
        self.normalize_graphs_output = normalize_graphs_output
        self.graphs_output_dir = graphs_output_dir
        self.transformers_dir = transformers_dir

        # New time series parameters
        self.time_column = time_column
        self.window_size = window_size
        self.horizon = horizon
        self.step_size = step_size
        self.max_sequence_length = max_sequence_length
        self.use_dtw = use_dtw
        # New parameter for trial segmentation
        self.sequence_categorical = sequence_categorical
        # NEW: dynamic window adjustment flag (if True, no global padding is applied)
        self.dynamic_window_adjustment = dynamic_window_adjustment

        # (… rest of __init__ remains unchanged …)
        # Initialize hierarchical categories holder (used later in temporal encoding)
        self.hierarchical_categories = {}

        # Determine model category (override if time series keywords found)
        model_type_lower = self.model_type.lower()
        if any(kw in model_type_lower for kw in ['lstm', 'rnn', 'time series']):
            self.model_category = 'time_series'
        else:
            self.model_category = self.map_model_type_to_category()

        self.categorical_indices = []

        if self.model_category == 'unknown':
            self.logger = logging.getLogger(self.__class__.__name__)
            self.logger.error(f"Model category for '{self.model_type}' is unknown. Check your configuration.")
            raise ValueError(f"Model category for '{self.model_type}' is unknown. Check your configuration.")

        # --- Updated target variable initialization logic ---
        if self.mode in ['train', 'predict']:
            if not self.y_variable:
                raise ValueError("Target variable 'y_variable' must be specified for supervised models in train/predict mode.")
        elif self.mode == 'clustering':
            self.y_variable = []
        # ----------------------------------------------------

        # Initialize other variables
        self.scaler = None
        self.transformer = None
        self.ordinal_encoder = None
        self.nominal_encoder = None
        self.preprocessor = None
        self.smote = None
        self.feature_reasons = {col: '' for col in self.ordinal_categoricals + self.nominal_categoricals + self.numericals}
        self.preprocessing_steps = []
        self.normality_results = {}
        self.features_to_transform = []
        self.nominal_encoded_feature_names = []
        self.final_feature_order = []

        # Initialize placeholders for clustering-specific transformers
        self.cluster_transformers = {}
        self.cluster_model = None
        self.cluster_labels = None
        self.silhouette_score = None

        # Define default thresholds for SMOTE recommendations
        self.imbalance_threshold = self.options.get('smote_recommendation', {}).get('imbalance_threshold', 0.1)
        self.noise_threshold = self.options.get('smote_recommendation', {}).get('noise_threshold', 0.1)
        self.overlap_threshold = self.options.get('smote_recommendation', {}).get('overlap_threshold', 0.1)
        self.boundary_threshold = self.options.get('smote_recommendation', {}).get('boundary_threshold', 0.1)

        self.pipeline = None  # Initialize pipeline

        # Initialize logging
        self.logger = logging.getLogger(self.__class__.__name__)
        self.logger.setLevel(logging.DEBUG if self.debug else logging.INFO)
        handler = logging.StreamHandler()
        formatter = logging.Formatter('%(asctime)s [%(levelname)s] %(message)s')
        handler.setFormatter(formatter)
        if not self.logger.handlers:
            self.logger.addHandler(handler)
            
        # Initialize feature_reasons with 'all_numericals' for clustering
        self.feature_reasons = {col: '' for col in self.ordinal_categoricals + self.nominal_categoricals + self.numericals}
        if self.model_category == 'clustering':
            self.feature_reasons['all_numericals'] = ''

    def get_debug_flag(self, flag_name: str) -> bool:
        """
        Retrieve the value of a specific debug flag from the options.
        Args:
            flag_name (str): The name of the debug flag.
        Returns:
            bool: The value of the debug flag.
        """
        return self.options.get(flag_name, False)

    def _log(self, message: str, step: str, level: str = 'info'):
        """
        Internal method to log messages based on the step-specific debug flags.
        
        Args:
            message (str): The message to log.
            step (str): The preprocessing step name.
            level (str): The logging level ('info', 'debug', etc.).
        """
        debug_flag = self.get_debug_flag(f'debug_{step}')
        if debug_flag:
            if level == 'debug':
                self.logger.debug(message)
            elif level == 'info':
                self.logger.info(message)
            elif level == 'warning':
                self.logger.warning(message)
            elif level == 'error':
                self.logger.error(message)

    def map_model_type_to_category(self) -> str:
        """
        Map the model_type string to a predefined category based on keywords.

        Returns:
            str: The model category ('classification', 'regression', 'clustering', etc.).
        """
        classification_keywords = ['classifier', 'classification', 'logistic', 'svm', 'support vector machine', 'knn', 'neural network']
        regression_keywords = ['regressor', 'regression', 'linear', 'knn', 'neural network']  # Removed 'svm'
        clustering_keywords = ['k-means', 'clustering', 'dbscan', 'kmodes', 'kprototypes']

        model_type_lower = self.model_type.lower()

        for keyword in classification_keywords:
            if keyword in model_type_lower:
                return 'classification'

        for keyword in regression_keywords:
            if keyword in model_type_lower:
                return 'regression'

        for keyword in clustering_keywords:
            if keyword in model_type_lower:
                return 'clustering'

        return 'unknown'

    def filter_columns(self, df: pd.DataFrame) -> pd.DataFrame:
        step_name = "filter_columns"
        self.logger.info(f"Step: {step_name}")

        # Combine all feature lists from configuration
        desired_features = self.numericals + self.ordinal_categoricals + self.nominal_categoricals

        # For time series models, ensure the time column is included
        if self.model_category == 'time_series' and self.time_column:
            if self.time_column not in df.columns:
                self.logger.error(f"Time column '{self.time_column}' not found in input data.")
                raise ValueError(f"Time column '{self.time_column}' not found in the input data.")
            # Add the time column if it is not already part of the feature lists
            if self.time_column not in desired_features:
                desired_features.append(self.time_column)

        # Debug log: report target variable info
        self.logger.debug(f"y_variable provided: {self.y_variable}")
        if self.y_variable and all(col in df.columns for col in self.y_variable):
            self.logger.debug(f"Unique values in target column(s): {df[self.y_variable].drop_duplicates().to_dict()}")

        # For 'train' mode, ensure the target variable is present and excluded from features
        if self.mode == 'train':
            if not all(col in df.columns for col in self.y_variable):
                missing_y = [col for col in self.y_variable if col not in df.columns]
                self.logger.error(f"Target variable(s) {missing_y} not found in the input data.")
                raise ValueError(f"Target variable(s) {missing_y} not found in the input data.")
            # Exclude y_variable from features (if present)
            desired_features = [col for col in desired_features if col not in self.y_variable]
            # Retain y_variable in the final DataFrame
            filtered_df = df[desired_features + self.y_variable].copy()
        else:
            # For 'predict' and 'clustering' modes, exclude y_variable from the features
            filtered_df = df[desired_features].copy()

        # Check that all desired features are present in the input DataFrame
        missing_features = [col for col in desired_features if col not in df.columns]
        if missing_features:
            self.logger.error(f"The following required features are missing in the input data: {missing_features}")
            raise ValueError(f"The following required features are missing in the input data: {missing_features}")

        self.logger.info(f"✅ Filtered DataFrame to include only specified features. Shape: {filtered_df.shape}")
        self.logger.debug(f"Selected Features: {desired_features}")
        if self.mode == 'train':
            self.logger.debug(f"Retained Target Variable(s): {self.y_variable}")

        return filtered_df


    def create_sequences_by_category(self, X: np.ndarray, y: np.ndarray, group_ids: np.ndarray) -> Tuple[Any, Any, np.ndarray]:
        """
        Create sequences by grouping samples based on the categorical column(s) provided in self.sequence_categorical.
        Now also returns the grouping key for each sequence.
        
        Args:
            X: Preprocessed feature array, shape (num_samples, num_features)
            y: Target array, shape (num_samples,)
            group_ids: Array of group identifiers. If multidimensional, each row is a grouping key.
        
        Returns:
            X_seq: Array or list of sequences.
            y_seq: Array or list of target sequences.
            group_keys: Array of grouping keys (one per sequence).
        """
        # Convert group_ids to tuple keys if more than one grouping column is provided.
        if group_ids.ndim > 1:
            group_keys_full = np.array([tuple(row) for row in group_ids])
        else:
            group_keys_full = group_ids
        
        unique_groups = np.unique(group_keys_full, axis=0)
        sequences_X = []
        sequences_y = []
        group_keys_list = []
        
        # For debugging, store example sequences before and after alignment.
        example_before = None
        example_after = None
        
        # First, collect sequences per unique group.
        for idx, group in enumerate(unique_groups):
            if group_keys_full.ndim > 1:
                indices = np.where(np.all(group_keys_full == group, axis=1))[0]
            else:
                indices = np.where(group_keys_full == group)[0]
    
            seq_X = X[indices, :]
            seq_y = y[indices]
            sequences_X.append(seq_X)
            sequences_y.append(seq_y)
            group_keys_list.append(group)
            self.logger.debug(f"Group {group} - seq_y shape: {seq_y.shape}")
    
            if self.get_debug_flag("debug_sequence_examples") and idx < 2:
                example_before = seq_X.copy()  # Save the original unaligned sequence
    
        if not self.dynamic_window_adjustment:
            # Determine the maximum sequence length among all groups.
            max_length = max(seq.shape[0] for seq in sequences_X)
            self.logger.debug(f"Maximum sequence length determined: {max_length}")
    
        aligned_X = []
        aligned_y = []
    
        for idx, (seq_X, seq_y) in enumerate(zip(sequences_X, sequences_y)):
            current_length = seq_X.shape[0]
            if not self.dynamic_window_adjustment and current_length < max_length:
                if self.use_dtw:
                    self.logger.debug(f"Group {unique_groups[idx]}: applying DTW warping. Original shape: {seq_X.shape}")
                    original_seq = seq_X.copy()
                    path = dtw_path(seq_X, seq_X)  # Warping path (using the sequence itself as reference)
                    seq_X_aligned = warp_sequence(seq_X, path, max_length)
                    pad_width = max_length - current_length
                    seq_y_aligned = np.pad(seq_y, ((0, pad_width), (0, 0)), mode='constant', constant_values=0)
                    if self.get_debug_flag("debug_sequence_examples") and idx < 2:
                        example_after = seq_X_aligned.copy()
                        self.logger.debug(f"Group {unique_groups[idx]} DTW Example - Before (first 5 rows): {original_seq[:5]}")
                        self.logger.debug(f"Group {unique_groups[idx]} DTW Example - After (first 5 rows): {seq_X_aligned[:5]}")
                else:
                    self.logger.debug(f"Group {unique_groups[idx]}: applying zero padding. Original shape: {seq_X.shape}")
                    original_seq = seq_X.copy()
                    pad_width = max_length - current_length
                    seq_X_aligned = np.pad(seq_X, ((0, pad_width), (0, 0)), mode='constant', constant_values=0)
                    seq_y_aligned = np.pad(seq_y, ((0, pad_width), (0, 0)), mode='constant', constant_values=0)
                    if self.get_debug_flag("debug_sequence_examples") and idx < 2:
                        example_after = seq_X_aligned.copy()
                        self.logger.debug(f"Group {unique_groups[idx]} Padding Example - Before (first 5 rows): {original_seq[:5]}")
                        self.logger.debug(f"Group {unique_groups[idx]} Padding Example - After (first 5 rows): {seq_X_aligned[:5]}")
            else:
                # When dynamic_window_adjustment is True, do not pad (i.e. keep natural length).
                seq_X_aligned = seq_X
                seq_y_aligned = seq_y
                if self.dynamic_window_adjustment:
                    self.logger.debug(f"Group {unique_groups[idx]}: dynamic window adjustment enabled; no padding applied. Sequence shape: {seq_X.shape}")
    
            aligned_X.append(seq_X_aligned)
            aligned_y.append(seq_y_aligned)
    
        if self.get_debug_flag("debug_sequence_examples"):
            self.logger.debug("Example sequence before alignment (first group):")
            self.logger.debug(example_before)
            self.logger.debug("Example sequence after alignment (first group):")
            self.logger.debug(example_after)
    
        # If dynamic window adjustment is enabled, return lists (variable lengths);
        # otherwise, convert to np.array.
        if self.dynamic_window_adjustment:
            X_seq = aligned_X
            y_seq = aligned_y
        else:
            X_seq = np.array(aligned_X)
            y_seq = np.array(aligned_y)
    
        return X_seq, y_seq, np.array(group_keys_list)



    def apply_dtw_alignment(self, sequences: np.ndarray) -> np.ndarray:
        """
        Align a set of sequences using DTW so that all sequences match the reference length.
        
        Args:
            sequences: Array of sequences with shape (num_sequences, seq_length, num_features)
        
        Returns:
            aligned_sequences: Array of DTW-aligned sequences.
        """
        ref = sequences[0]
        target_length = ref.shape[0]
        aligned_sequences = []
        
        for seq in sequences:
            path = dtw_path(seq, ref)
            aligned_seq = warp_sequence(seq, path, target_length)
            aligned_sequences.append(aligned_seq)
        
        return np.array(aligned_sequences)

    def create_sequences(self, X: np.ndarray, y: np.ndarray) -> Tuple[Any, Any]:
        """
        Create overlapping sequences (windows) and their corresponding targets.
        If dynamic_window_adjustment is enabled, sequences are not padded to a fixed max length.
        
        Args:
            X: Preprocessed feature array, shape (num_samples, num_features)
            y: Target array, shape (num_samples,)
        
        Returns:
            X_seq: Array (or list) of sequences.
            y_seq: Array (or list) of target windows.
        """
        X_seq, y_seq = [], []
        for i in range(0, len(X) - self.window_size - self.horizon + 1, self.step_size):
            seq_X = X[i:i+self.window_size]
            seq_y = y[i+self.window_size:i+self.window_size+self.horizon]
            if not self.dynamic_window_adjustment and self.max_sequence_length and seq_X.shape[0] < self.max_sequence_length:
                pad_width = self.max_sequence_length - seq_X.shape[0]
                seq_X = np.pad(seq_X, ((0, pad_width), (0, 0)), mode='constant', constant_values=0)
            X_seq.append(seq_X)
            y_seq.append(seq_y)
    
        if not self.dynamic_window_adjustment:
            X_seq = np.array(X_seq)
            y_seq = np.array(y_seq)
    
        # --- NEW: Squeeze extra dimension in y_seq if present ---
        if isinstance(y_seq, np.ndarray) and y_seq.ndim == 3 and y_seq.shape[-1] == 1:
            y_seq = np.squeeze(y_seq, axis=-1)
            self.logger.debug("Squeezed extra dimension from y_seq to shape: " + str(y_seq.shape))
        # ---------------------------------------------------------
    
        # --- NEW: Check uniformity before applying DTW alignment ---
        if self.use_dtw:
            if not self.dynamic_window_adjustment and np.all([seq.shape[0] == X_seq[0].shape[0] for seq in X_seq]):
                self.logger.debug("All sequences are already uniform; skipping DTW alignment.")
            else:
                X_seq = self.apply_dtw_alignment(X_seq)
        # -------------------------------------------------------------
    
        return X_seq, y_seq

    def temporal_encode_sequences(self, X_seq: Any, group_keys: np.ndarray) -> Any:
        """
        Append hierarchical one-hot encoding for grouping variables and positional encoding to each sequence.
        This method assumes that for each sequence, the group key (or tuple of keys) is constant.
        
        Args:
            X_seq: A list or array of sequences (each sequence of shape (seq_length, num_features))
            group_keys: Array of group keys for each sequence.
                        If multiple grouping variables are provided, group_keys is 2D.
        
        Returns:
            New sequences with extra features appended (last dimensions increased by [one-hot dims + 1]).
        """
        # Compute and store the unique categories for each grouping variable if not already done.
        if group_keys.ndim == 1:
            group_keys = group_keys.reshape(-1, 1)
        num_group = group_keys.shape[1]
        for i in range(num_group):
            col_name = self.sequence_categorical[i] if isinstance(self.sequence_categorical, list) else self.sequence_categorical
            if col_name not in self.hierarchical_categories or not self.hierarchical_categories[col_name]:
                self.hierarchical_categories[col_name] = sorted(np.unique(group_keys[:, i]))
                self.logger.debug(f"Hierarchical categories for '{col_name}': {self.hierarchical_categories[col_name]}")
    
        encoded_sequences = []
        # Loop over each sequence
        for idx, seq in enumerate(X_seq):
            seq_length = seq.shape[0]
            # Positional encoding: normalized time indices as a single column.
            pos_encoding = np.linspace(0, 1, seq_length).reshape(-1, 1)
    
            # Hierarchical one-hot encoding for each grouping variable.
            if group_keys.shape[1] == 1:
                group_value = group_keys[idx, 0]
                col_name = self.sequence_categorical[0] if isinstance(self.sequence_categorical, list) else self.sequence_categorical
                categories = self.hierarchical_categories[col_name]
                one_hot = np.zeros((seq_length, len(categories)))
                if group_value in categories:
                    one_hot[:, categories.index(group_value)] = 1
                else:
                    self.logger.warning(f"Group key {group_value} not found in categories for '{col_name}'.")
            else:
                one_hot_list = []
                for i in range(group_keys.shape[1]):
                    col_name = self.sequence_categorical[i] if isinstance(self.sequence_categorical, list) else self.sequence_categorical
                    categories = self.hierarchical_categories[col_name]
                    group_value = group_keys[idx, i]
                    one_hot_col = np.zeros((seq_length, len(categories)))
                    if group_value in categories:
                        one_hot_col[:, categories.index(group_value)] = 1
                    else:
                        self.logger.warning(f"Group value {group_value} not found in categories for '{col_name}'.")
                    one_hot_list.append(one_hot_col)
                one_hot = np.concatenate(one_hot_list, axis=1)
    
            # Concatenate the original features, one-hot encoding, and positional encoding.
            seq_encoded = np.concatenate([seq, one_hot, pos_encoding], axis=1)
            encoded_sequences.append(seq_encoded)
    
        if not self.dynamic_window_adjustment:
            encoded_sequences = np.array(encoded_sequences)
        return encoded_sequences



    def split_dataset(
        self,
        X: pd.DataFrame,
        y: Optional[pd.Series] = None
    ) -> Tuple[pd.DataFrame, Optional[pd.DataFrame], Optional[pd.Series], Optional[pd.Series]]:
        """
        Split the dataset into training and testing sets while retaining original indices.

        Args:
            X (pd.DataFrame): Features.
            y (Optional[pd.Series]): Target variable.

        Returns:
            Tuple[pd.DataFrame, Optional[pd.DataFrame], Optional[pd.Series], Optional[pd.Series]]: X_train, X_test, y_train, y_test
        """
        step_name = "split_dataset"
        self.logger.info("Step: Split Dataset into Train and Test")

        # Debugging Statements
        self._log(f"Before Split - X shape: {X.shape}", step_name, 'debug')
        if y is not None:
            self._log(f"Before Split - y shape: {y.shape}", step_name, 'debug')
        else:
            self._log("Before Split - y is None", step_name, 'debug')

        # Determine splitting based on mode
        if self.mode == 'train' and self.model_category in ['classification', 'regression']:
            if self.model_category == 'classification':
                stratify = y if self.options.get('split_dataset', {}).get('stratify_for_classification', False) else None
                test_size = self.options.get('split_dataset', {}).get('test_size', 0.2)
                random_state = self.options.get('split_dataset', {}).get('random_state', 42)
                X_train, X_test, y_train, y_test = train_test_split(
                    X, y, 
                    test_size=test_size,
                    stratify=stratify, 
                    random_state=random_state
                )
                self._log("Performed stratified split for classification.", step_name, 'debug')
            elif self.model_category == 'regression':
                X_train, X_test, y_train, y_test = train_test_split(
                    X, y, 
                    test_size=self.options.get('split_dataset', {}).get('test_size', 0.2),
                    random_state=self.options.get('split_dataset', {}).get('random_state', 42)
                )
                self._log("Performed random split for regression.", step_name, 'debug')
        else:
            # For 'predict' and 'clustering' modes or other categories
            X_train = X.copy()
            X_test = None
            y_train = y.copy() if y is not None else None
            y_test = None
            self.logger.info(f"No splitting performed for mode '{self.mode}' or model category '{self.model_category}'.")

        self.preprocessing_steps.append("Split Dataset into Train and Test")

        # Keep Indices Aligned Through Each Step
        if X_test is not None and y_test is not None:
            # Sort both X_test and y_test by index
            X_test = X_test.sort_index()
            y_test = y_test.sort_index()
            self.logger.debug("Sorted X_test and y_test by index for alignment.")

        # Debugging: Log post-split shapes and index alignment
        self._log(f"After Split - X_train shape: {X_train.shape}, X_test shape: {X_test.shape if X_test is not None else 'N/A'}", step_name, 'debug')
        if self.model_category == 'classification' and y_train is not None and y_test is not None:
            self.logger.debug(f"Class distribution in y_train:\n{y_train.value_counts(normalize=True)}")
            self.logger.debug(f"Class distribution in y_test:\n{y_test.value_counts(normalize=True)}")
        elif self.model_category == 'regression' and y_train is not None and y_test is not None:
            self.logger.debug(f"y_train statistics:\n{y_train.describe()}")
            self.logger.debug(f"y_test statistics:\n{y_test.describe()}")

        # Check index alignment
        if y_train is not None and X_train.index.equals(y_train.index):
            self.logger.debug("X_train and y_train indices are aligned.")
        else:
            self.logger.warning("X_train and y_train indices are misaligned.")

        if X_test is not None and y_test is not None and X_test.index.equals(y_test.index):
            self.logger.debug("X_test and y_test indices are aligned.")
        elif X_test is not None and y_test is not None:
            self.logger.warning("X_test and y_test indices are misaligned.")

        return X_train, X_test, y_train, y_test

    def handle_missing_values(self, X_train: pd.DataFrame, X_test: Optional[pd.DataFrame] = None) -> Tuple[pd.DataFrame, Optional[pd.DataFrame]]:
        """
        Handle missing values for numerical and categorical features based on user options.
        """
        step_name = "handle_missing_values"
        self.logger.info("Step: Handle Missing Values")

        # Fetch user-defined imputation options or set defaults
        impute_options = self.options.get('handle_missing_values', {})
        numerical_strategy = impute_options.get('numerical_strategy', {})
        categorical_strategy = impute_options.get('categorical_strategy', {})

        # Numerical Imputation
        numerical_imputer = None
        new_columns = []
        if self.numericals:
            if self.model_category in ['regression', 'classification', 'clustering']:
                default_num_strategy = 'median'  # Changed to median as per preprocessor_config.yaml
            else:
                default_num_strategy = 'median'
            num_strategy = numerical_strategy.get('strategy', default_num_strategy)
            num_imputer_type = numerical_strategy.get('imputer', 'SimpleImputer')  # Can be 'SimpleImputer', 'KNNImputer', etc.

            self._log(f"Numerical Imputation Strategy: {num_strategy.capitalize()}, Imputer Type: {num_imputer_type}", step_name, 'debug')

            # Initialize numerical imputer based on user option
            if num_imputer_type == 'SimpleImputer':
                numerical_imputer = SimpleImputer(strategy=num_strategy)
            elif num_imputer_type == 'KNNImputer':
                knn_neighbors = numerical_strategy.get('knn_neighbors', 5)
                numerical_imputer = KNNImputer(n_neighbors=knn_neighbors)
            else:
                self.logger.error(f"Numerical imputer type '{num_imputer_type}' is not supported.")
                raise ValueError(f"Numerical imputer type '{num_imputer_type}' is not supported.")

            # Fit and transform ONLY on X_train
            X_train[self.numericals] = numerical_imputer.fit_transform(X_train[self.numericals])
            self.numerical_imputer = numerical_imputer  # Assign to self for saving
            self.feature_reasons.update({col: self.feature_reasons.get(col, '') + f'Numerical: {num_strategy.capitalize()} Imputation | ' for col in self.numericals})
            new_columns.extend(self.numericals)

            if X_test is not None:
                # Transform ONLY on X_test without fitting
                X_test[self.numericals] = numerical_imputer.transform(X_test[self.numericals])

        # Categorical Imputation
        categorical_imputer = None
        all_categoricals = self.ordinal_categoricals + self.nominal_categoricals
        if all_categoricals:
            default_cat_strategy = 'most_frequent'
            cat_strategy = categorical_strategy.get('strategy', default_cat_strategy)
            cat_imputer_type = categorical_strategy.get('imputer', 'SimpleImputer')

            self._log(f"Categorical Imputation Strategy: {cat_strategy.capitalize()}, Imputer Type: {cat_imputer_type}", step_name, 'debug')

            # Initialize categorical imputer based on user option
            if cat_imputer_type == 'SimpleImputer':
                categorical_imputer = SimpleImputer(strategy=cat_strategy)
            elif cat_imputer_type == 'ConstantImputer':
                fill_value = categorical_strategy.get('fill_value', 'Missing')
                categorical_imputer = SimpleImputer(strategy='constant', fill_value=fill_value)
            else:
                self.logger.error(f"Categorical imputer type '{cat_imputer_type}' is not supported.")
                raise ValueError(f"Categorical imputer type '{cat_imputer_type}' is not supported.")

            # Fit and transform ONLY on X_train
            X_train[all_categoricals] = categorical_imputer.fit_transform(X_train[all_categoricals])
            self.categorical_imputer = categorical_imputer  # Assign to self for saving
            self.feature_reasons.update({
                col: self.feature_reasons.get(col, '') + (f'Categorical: Constant Imputation (Value={categorical_strategy.get("fill_value", "Missing")}) | ' if cat_imputer_type == 'ConstantImputer' else f'Categorical: {cat_strategy.capitalize()} Imputation | ')
                for col in all_categoricals
            })
            new_columns.extend(all_categoricals)

            if X_test is not None:
                # Transform ONLY on X_test without fitting
                X_test[all_categoricals] = categorical_imputer.transform(X_test[all_categoricals])

        self.preprocessing_steps.append("Handle Missing Values")

        # Debugging: Log post-imputation shapes and missing values
        self._log(f"Completed: Handle Missing Values. Dataset shape after imputation: {X_train.shape}", step_name, 'debug')
        self._log(f"Missing values after imputation in X_train:\n{X_train.isnull().sum()}", step_name, 'debug')
        self._log(f"New columns handled: {new_columns}", step_name, 'debug')

        return X_train, X_test

    def handle_outliers(self, X_train: pd.DataFrame, y_train: Optional[pd.Series] = None) -> Tuple[pd.DataFrame, Optional[pd.Series]]:
        """
        Handle outliers based on the model's sensitivity and user options.
        For time_series models, apply a custom outlier handling using a rolling median filter
        to replace extreme values rather than dropping rows (to preserve temporal alignment).

        Args:
            X_train (pd.DataFrame): Training features.
            y_train (pd.Series, optional): Training target.

        Returns:
            tuple: X_train with outliers handled and corresponding y_train.
        """
        step_name = "handle_outliers"
        self.logger.info("Step: Handle Outliers")
        self._log("Starting outlier handling.", step_name, 'debug')
        debug_flag = self.get_debug_flag('debug_handle_outliers')
        initial_shape = X_train.shape[0]
        outlier_options = self.options.get('handle_outliers', {})
        zscore_threshold = outlier_options.get('zscore_threshold', 3)
        iqr_multiplier = outlier_options.get('iqr_multiplier', 1.5)
        isolation_contamination = outlier_options.get('isolation_contamination', 0.05)

        # ----- NEW: Custom outlier handling branch for time series -----
        if self.model_category == 'time_series':
            self.logger.info("Applying custom outlier handling for time_series using rolling median filter.")
            # For time series, do not drop rows—instead, replace outliers with the rolling median.
            for col in self.numericals:
                # Compute rolling statistics with a window of 5 (centered)
                rolling_median = X_train[col].rolling(window=5, center=True, min_periods=1).median()
                rolling_q1 = X_train[col].rolling(window=5, center=True, min_periods=1).quantile(0.25)
                rolling_q3 = X_train[col].rolling(window=5, center=True, min_periods=1).quantile(0.75)
                rolling_iqr = rolling_q3 - rolling_q1
                # Identify outliers as those deviating more than the multiplier times the rolling IQR
                outlier_mask = abs(X_train[col] - rolling_median) > (iqr_multiplier * rolling_iqr)
                num_outliers = outlier_mask.sum()
                # Replace outlier values with the corresponding rolling median
                X_train.loc[outlier_mask, col] = rolling_median[outlier_mask]
                self.logger.debug(f"Replaced {num_outliers} outliers in column '{col}' with rolling median.")
            self.preprocessing_steps.append("Handle Outliers (time_series custom)")
            self._log(f"Completed: Handle Outliers for time_series. Initial samples: {initial_shape}, Final samples: {X_train.shape[0]}", step_name, 'debug')
            return X_train, y_train
        # -----------------------------------------------------------------

        # Existing outlier handling for regression and classification
        if self.model_category in ['regression', 'classification']:
            self.logger.info(f"Applying univariate outlier detection for {self.model_category}.")
            for col in self.numericals:
                # Z-Score Filtering
                apply_zscore = outlier_options.get('apply_zscore', True)
                if apply_zscore:
                    z_scores = np.abs((X_train[col] - X_train[col].mean()) / X_train[col].std())
                    mask_z = z_scores < zscore_threshold
                    removed_z = (~mask_z).sum()
                    X_train = X_train[mask_z]
                    if y_train is not None:
                        y_train = y_train.loc[X_train.index]
                    self.feature_reasons[col] += f'Outliers handled with Z-Score Filtering (threshold={zscore_threshold}) | '
                    self._log(f"Removed {removed_z} outliers from '{col}' using Z-Score Filtering.", step_name, 'debug')

                # IQR Filtering
                apply_iqr = outlier_options.get('apply_iqr', True)
                if apply_iqr:
                    Q1 = X_train[col].quantile(0.25)
                    Q3 = X_train[col].quantile(0.75)
                    IQR = Q3 - Q1
                    lower_bound = Q1 - iqr_multiplier * IQR
                    upper_bound = Q3 + iqr_multiplier * IQR
                    mask_iqr = (X_train[col] >= lower_bound) & (X_train[col] <= upper_bound)
                    removed_iqr = (~mask_iqr).sum()
                    X_train = X_train[mask_iqr]
                    if y_train is not None:
                        y_train = y_train.loc[X_train.index]
                    self.feature_reasons[col] += f'Outliers handled with IQR Filtering (multiplier={iqr_multiplier}) | '
                    self._log(f"Removed {removed_iqr} outliers from '{col}' using IQR Filtering.", step_name, 'debug')

        elif self.model_category == 'clustering':
            self.logger.info("Applying multivariate IsolationForest for clustering.")
            contamination = isolation_contamination
            iso_forest = IsolationForest(contamination=contamination, random_state=42)
            preds = iso_forest.fit_predict(X_train[self.numericals])
            mask_iso = preds != -1
            removed_iso = (preds == -1).sum()
            X_train = X_train[mask_iso]
            if y_train is not None:
                y_train = y_train.loc[X_train.index]
            self.feature_reasons['all_numericals'] += f'Outliers handled with Multivariate IsolationForest (contamination={contamination}) | '
            self._log(f"Removed {removed_iso} outliers using Multivariate IsolationForest.", step_name, 'debug')
        else:
            self.logger.warning(f"Model category '{self.model_category}' not recognized for outlier handling.")

        self.preprocessing_steps.append("Handle Outliers")
        self._log(f"Completed: Handle Outliers. Initial samples: {initial_shape}, Final samples: {X_train.shape[0]}", step_name, 'debug')
        self._log(f"Missing values after outlier handling in X_train:\n{X_train.isnull().sum()}", step_name, 'debug')
        return X_train, y_train


    def test_normality(self, X_train: pd.DataFrame) -> Dict[str, Dict]:
        """
        Test normality for numerical features based on normality tests and user options.

        Args:
            X_train (pd.DataFrame): Training features.

        Returns:
            Dict[str, Dict]: Dictionary with normality test results for each numerical feature.
        """
        step_name = "Test for Normality"
        self.logger.info(f"Step: {step_name}")
        debug_flag = self.get_debug_flag('debug_test_normality')
        normality_results = {}

        # Fetch user-defined normality test options or set defaults
        normality_options = self.options.get('test_normality', {})
        p_value_threshold = normality_options.get('p_value_threshold', 0.05)
        skewness_threshold = normality_options.get('skewness_threshold', 1.0)
        additional_tests = normality_options.get('additional_tests', [])  # e.g., ['anderson-darling']

        for col in self.numericals:
            data = X_train[col].dropna()
            skewness = data.skew()
            kurtosis = data.kurtosis()

            # Determine which normality test to use based on sample size and user options
            test_used = 'Shapiro-Wilk'
            p_value = 0.0

            if len(data) <= 5000:
                from scipy.stats import shapiro
                stat, p_val = shapiro(data)
                test_used = 'Shapiro-Wilk'
                p_value = p_val
            else:
                from scipy.stats import anderson
                result = anderson(data)
                test_used = 'Anderson-Darling'
                # Determine p-value based on critical values
                p_value = 0.0  # Default to 0
                for cv, sig in zip(result.critical_values, result.significance_level):
                    if result.statistic < cv:
                        p_value = sig / 100
                        break

            # Apply user-defined or default criteria
            if self.model_category in ['regression', 'classification', 'clustering']:
                # Linear, Logistic Regression, and Clustering: Use p-value and skewness
                needs_transform = (p_value < p_value_threshold) or (abs(skewness) > skewness_threshold)
            else:
                # Other models: Use skewness, and optionally p-values based on options
                use_p_value = normality_options.get('use_p_value_other_models', False)
                if use_p_value:
                    needs_transform = (p_value < p_value_threshold) or (abs(skewness) > skewness_threshold)
                else:
                    needs_transform = abs(skewness) > skewness_threshold

            normality_results[col] = {
                'skewness': skewness,
                'kurtosis': kurtosis,
                'p_value': p_value,
                'test_used': test_used,
                'needs_transform': needs_transform
            }

            # Conditional Detailed Logging
            if debug_flag:
                self._log(f"Feature '{col}': p-value={p_value:.4f}, skewness={skewness:.4f}, needs_transform={needs_transform}", step_name, 'debug')

        self.normality_results = normality_results
        self.preprocessing_steps.append(step_name)

        # Completion Logging
        if debug_flag:
            self._log(f"Completed: {step_name}. Normality results computed.", step_name, 'debug')
        else:
            self.logger.info(f"Step '{step_name}' completed: Normality results computed.")

        return normality_results

    def encode_categorical_variables(self, X_train: pd.DataFrame, X_test: Optional[pd.DataFrame] = None) -> Tuple[pd.DataFrame, Optional[pd.DataFrame]]:
        """
        Encode categorical variables using user-specified encoding strategies.
        """
        step_name = "encode_categorical_variables"
        self.logger.info("Step: Encode Categorical Variables")
        self._log("Starting categorical variable encoding.", step_name, 'debug')

        # Fetch user-defined encoding options or set defaults
        encoding_options = self.options.get('encode_categoricals', {})
        ordinal_encoding = encoding_options.get('ordinal_encoding', 'OrdinalEncoder')  # Options: 'OrdinalEncoder', 'None'
        nominal_encoding = encoding_options.get('nominal_encoding', 'OneHotEncoder')  # Changed from 'OneHotEncoder' to 'OrdinalEncoder'
        handle_unknown = encoding_options.get('handle_unknown', 'use_encoded_value')  # Adjusted for OrdinalEncoder

        # Determine if SMOTENC is being used
        smote_variant = self.options.get('implement_smote', {}).get('variant', None)
        if smote_variant == 'SMOTENC':
            nominal_encoding = 'OrdinalEncoder'  # Ensure compatibility

        transformers = []
        new_columns = []
        if self.ordinal_categoricals and ordinal_encoding != 'None':
            if ordinal_encoding == 'OrdinalEncoder':
                transformers.append(
                    ('ordinal', OrdinalEncoder(), self.ordinal_categoricals)
                )
                self._log(f"Added OrdinalEncoder for features: {self.ordinal_categoricals}", step_name, 'debug')
            else:
                self.logger.error(f"Ordinal encoding method '{ordinal_encoding}' is not supported.")
                raise ValueError(f"Ordinal encoding method '{ordinal_encoding}' is not supported.")
        if self.nominal_categoricals and nominal_encoding != 'None':
            if nominal_encoding == 'OrdinalEncoder':
                transformers.append(
                    ('nominal', OrdinalEncoder(handle_unknown=handle_unknown), self.nominal_categoricals)
                )
                self._log(f"Added OrdinalEncoder for features: {self.nominal_categoricals}", step_name, 'debug')
            elif nominal_encoding == 'FrequencyEncoder':
                # Custom Frequency Encoding
                for col in self.nominal_categoricals:
                    freq = X_train[col].value_counts(normalize=True)
                    X_train[col] = X_train[col].map(freq)
                    if X_test is not None:
                        X_test[col] = X_test[col].map(freq).fillna(0)
                    self.feature_reasons[col] += 'Encoded with Frequency Encoding | '
                    self._log(f"Applied Frequency Encoding to '{col}'.", step_name, 'debug')
            else:
                self.logger.error(f"Nominal encoding method '{nominal_encoding}' is not supported.")
                raise ValueError(f"Nominal encoding method '{nominal_encoding}' is not supported.")

        if not transformers and 'FrequencyEncoder' not in nominal_encoding:
            self.logger.info("No categorical variables to encode.")
            self.preprocessing_steps.append("Encode Categorical Variables")
            self._log(f"Completed: Encode Categorical Variables. No encoding was applied.", step_name, 'debug')
            return X_train, X_test

        if transformers:
            self.preprocessor = ColumnTransformer(
                transformers=transformers,
                remainder='passthrough',
                verbose_feature_names_out=False  # Disable prefixing
            )

            # Fit and transform training data
            X_train_encoded = self.preprocessor.fit_transform(X_train)
            self._log("Fitted and transformed X_train with ColumnTransformer.", step_name, 'debug')

            # Transform testing data
            if X_test is not None:
                X_test_encoded = self.preprocessor.transform(X_test)
                self._log("Transformed X_test with fitted ColumnTransformer.", step_name, 'debug')
            else:
                X_test_encoded = None

            # Retrieve feature names after encoding
            encoded_feature_names = []
            if self.ordinal_categoricals and ordinal_encoding == 'OrdinalEncoder':
                encoded_feature_names += self.ordinal_categoricals
            if self.nominal_categoricals and nominal_encoding == 'OrdinalEncoder':
                encoded_feature_names += self.nominal_categoricals
            elif self.nominal_categoricals and nominal_encoding == 'FrequencyEncoder':
                encoded_feature_names += self.nominal_categoricals
            passthrough_features = [col for col in X_train.columns if col not in self.ordinal_categoricals + self.nominal_categoricals]
            encoded_feature_names += passthrough_features
            new_columns.extend(encoded_feature_names)

            # Convert numpy arrays back to DataFrames
            X_train_encoded_df = pd.DataFrame(X_train_encoded, columns=encoded_feature_names, index=X_train.index)
            if X_test_encoded is not None:
                X_test_encoded_df = pd.DataFrame(X_test_encoded, columns=encoded_feature_names, index=X_test.index)
            else:
                X_test_encoded_df = None

            # Store encoders for inverse transformation
            self.ordinal_encoder = self.preprocessor.named_transformers_.get('ordinal', None)
            self.nominal_encoder = self.preprocessor.named_transformers_.get('nominal', None)

            self.preprocessing_steps.append("Encode Categorical Variables")
            self._log(f"Completed: Encode Categorical Variables. X_train_encoded shape: {X_train_encoded_df.shape}", step_name, 'debug')
            self._log(f"Columns after encoding: {encoded_feature_names}", step_name, 'debug')
            self._log(f"Sample of encoded X_train:\n{X_train_encoded_df.head()}", step_name, 'debug')
            self._log(f"New columns added: {new_columns}", step_name, 'debug')

            return X_train_encoded_df, X_test_encoded_df

    def generate_recommendations(self) -> pd.DataFrame:
        """
        Generate a table of preprocessing recommendations based on the model type, data, and user options.

        Returns:
            pd.DataFrame: DataFrame containing recommendations for each feature.
        """
        step_name = "Generate Preprocessor Recommendations"
        self.logger.info(f"Step: {step_name}")
        debug_flag = self.get_debug_flag('debug_generate_recommendations')

        # Generate recommendations based on feature reasons
        recommendations = {}
        for col in self.ordinal_categoricals + self.nominal_categoricals + self.numericals:
            reasons = self.feature_reasons.get(col, '').strip(' | ')
            recommendations[col] = reasons

        recommendations_table = pd.DataFrame.from_dict(
            recommendations, 
            orient='index', 
            columns=['Preprocessing Reason']
        )
        if debug_flag:
            self.logger.debug(f"Preprocessing Recommendations:\n{recommendations_table}")
        else:
            self.logger.info("Preprocessing Recommendations generated.")

        self.preprocessing_steps.append(step_name)

        # Completion Logging
        if debug_flag:
            self._log(f"Completed: {step_name}. Recommendations generated.", step_name, 'debug')
        else:
            self.logger.info(f"Step '{step_name}' completed: Recommendations generated.")

        return recommendations_table

    def save_transformers(self):
        step_name = "Save Transformers"
        self.logger.info(f"Step: {step_name}")
        debug_flag = self.get_debug_flag('debug_save_transformers')
        
        # Ensure the transformers directory exists
        os.makedirs(self.transformers_dir, exist_ok=True)
        transformers_path = os.path.join(self.transformers_dir, 'transformers.pkl')  # Consistent file path
        
        transformers = {
            'numerical_imputer': getattr(self, 'numerical_imputer', None),
            'categorical_imputer': getattr(self, 'categorical_imputer', None),
            'preprocessor': self.pipeline,   # Includes all preprocessing steps
            'smote': self.smote,
            'final_feature_order': self.final_feature_order,
            'categorical_indices': self.categorical_indices
        }
        try:
            joblib.dump(transformers, transformers_path)
            if debug_flag:
                self._log(f"Transformers saved at '{transformers_path}'.", step_name, 'debug')
            else:
                self.logger.info(f"Transformers saved at '{transformers_path}'.")
        except Exception as e:
            self.logger.error(f"❌ Failed to save transformers: {e}")
            raise

        self.preprocessing_steps.append(step_name)

    def load_transformers(self) -> dict:
        step_name = "Load Transformers"
        self.logger.info(f"Step: {step_name}")
        debug_flag = self.get_debug_flag('debug_load_transformers')  # Assuming a step-specific debug flag
        transformers_path = os.path.join(self.transformers_dir, 'transformers.pkl')  # Correct path

        # Debug log
        self.logger.debug(f"Loading transformers from: {transformers_path}")

        if not os.path.exists(transformers_path):
            self.logger.error(f"❌ Transformers file not found at '{transformers_path}'. Cannot proceed with prediction.")
            raise FileNotFoundError(f"Transformers file not found at '{transformers_path}'.")

        try:
            transformers = joblib.load(transformers_path)

            # Extract transformers
            numerical_imputer = transformers.get('numerical_imputer')
            categorical_imputer = transformers.get('categorical_imputer')
            preprocessor = transformers.get('preprocessor')
            smote = transformers.get('smote', None)
            final_feature_order = transformers.get('final_feature_order', [])
            categorical_indices = transformers.get('categorical_indices', [])
            self.categorical_indices = categorical_indices  # Set the attribute

            # **Post-Loading Debugging:**
            if preprocessor is not None:
                try:
                    # Do not attempt to transform dummy data here
                    self.logger.debug(f"Pipeline loaded. Ready to transform new data.")
                except AttributeError as e:
                    self.logger.error(f"Pipeline's get_feature_names_out is not available: {e}")
                    expected_features = []
            else:
                self.logger.error("❌ Preprocessor is not loaded.")
                raise AttributeError("Preprocessor is not loaded.")

        except Exception as e:
            self.logger.error(f"❌ Failed to load transformers: {e}")
            raise

        self.preprocessing_steps.append(step_name)

        # Additional checks
        if preprocessor is None:
            self.logger.error("❌ Preprocessor is not loaded.")

        if debug_flag:
            self._log(f"Transformers loaded successfully from '{transformers_path}'.", step_name, 'debug')
        else:
            self.logger.info(f"Transformers loaded successfully from '{transformers_path}'.")

        # Set the pipeline
        self.pipeline = preprocessor

        # Return the transformers as a dictionary
        return {
            'numerical_imputer': numerical_imputer,
            'categorical_imputer': categorical_imputer,
            'preprocessor': preprocessor,
            'smote': smote,
            'final_feature_order': final_feature_order,
            'categorical_indices': categorical_indices
        }

    def apply_scaling(self, X_train: pd.DataFrame, X_test: Optional[pd.DataFrame] = None) -> Tuple[pd.DataFrame, Optional[pd.DataFrame]]:
        """
        Apply scaling based on the model type and user options.

        Args:
            X_train (pd.DataFrame): Training features.
            X_test (Optional[pd.DataFrame]): Testing features.

        Returns:
            Tuple[pd.DataFrame, Optional[pd.DataFrame]]: Scaled X_train and X_test.
        """
        step_name = "Apply Scaling (If Needed by Model)"
        self.logger.info(f"Step: {step_name}")
        debug_flag = self.get_debug_flag('debug_apply_scaling')

        # Fetch user-defined scaling options or set defaults
        scaling_options = self.options.get('apply_scaling', {})
        scaling_method = scaling_options.get('method', None)  # 'StandardScaler', 'MinMaxScaler', 'RobustScaler', 'None'
        features_to_scale = scaling_options.get('features', self.numericals)

        scaler = None
        scaling_type = 'None'

        if scaling_method is None:
            # Default scaling based on model category
            if self.model_category in ['regression', 'classification', 'clustering']:
                # For clustering, MinMaxScaler is generally preferred
                if self.model_category == 'clustering':
                    scaler = MinMaxScaler()
                    scaling_type = 'MinMaxScaler'
                else:
                    scaler = StandardScaler()
                    scaling_type = 'StandardScaler'
            else:
                scaler = None
                scaling_type = 'None'
        else:
            # Normalize the scaling_method string to handle case-insensitivity
            scaling_method_normalized = scaling_method.lower()
            if scaling_method_normalized == 'standardscaler':
                scaler = StandardScaler()
                scaling_type = 'StandardScaler'
            elif scaling_method_normalized == 'minmaxscaler':
                scaler = MinMaxScaler()
                scaling_type = 'MinMaxScaler'
            elif scaling_method_normalized == 'robustscaler':
                scaler = RobustScaler()
                scaling_type = 'RobustScaler'
            elif scaling_method_normalized == 'none':
                scaler = None
                scaling_type = 'None'
            else:
                self.logger.error(f"Scaling method '{scaling_method}' is not supported.")
                raise ValueError(f"Scaling method '{scaling_method}' is not supported.")

        # Apply scaling if scaler is defined
        if scaler is not None and features_to_scale:
            self.scaler = scaler
            if debug_flag:
                self._log(f"Features to scale: {features_to_scale}", step_name, 'debug')

            # Check if features exist in the dataset
            missing_features = [feat for feat in features_to_scale if feat not in X_train.columns]
            if missing_features:
                self.logger.error(f"The following features specified for scaling are missing in the dataset: {missing_features}")
                raise KeyError(f"The following features specified for scaling are missing in the dataset: {missing_features}")

            X_train[features_to_scale] = scaler.fit_transform(X_train[features_to_scale])
            if X_test is not None:
                X_test[features_to_scale] = scaler.transform(X_test[features_to_scale])

            for col in features_to_scale:
                self.feature_reasons[col] += f'Scaling Applied: {scaling_type} | '

            self.preprocessing_steps.append(step_name)
            if debug_flag:
                self._log(f"Applied {scaling_type} to features: {features_to_scale}", step_name, 'debug')
                if hasattr(scaler, 'mean_'):
                    self._log(f"Scaler Parameters: mean={scaler.mean_}", step_name, 'debug')
                if hasattr(scaler, 'scale_'):
                    self._log(f"Scaler Parameters: scale={scaler.scale_}", step_name, 'debug')
                self._log(f"Sample of scaled X_train:\n{X_train[features_to_scale].head()}", step_name, 'debug')
                if X_test is not None:
                    self._log(f"Sample of scaled X_test:\n{X_test[features_to_scale].head()}", step_name, 'debug')
            else:
                self.logger.info(f"Step '{step_name}' completed: Applied {scaling_type} to features: {features_to_scale}")
        else:
            self.logger.info("No scaling applied based on user options or no features specified.")
            self.preprocessing_steps.append(step_name)
            if debug_flag:
                self._log(f"Completed: {step_name}. No scaling was applied.", step_name, 'debug')
            else:
                self.logger.info(f"Step '{step_name}' completed: No scaling was applied.")

        return X_train, X_test

    def determine_n_neighbors(self, minority_count: int, default_neighbors: int = 5) -> int:
        """
        Determine the appropriate number of neighbors for SMOTE based on minority class size.

        Args:
            minority_count (int): Number of samples in the minority class.
            default_neighbors (int): Default number of neighbors to use if possible.

        Returns:
            int: Determined number of neighbors for SMOTE.
        """
        if minority_count <= 1:
            raise ValueError("SMOTE cannot be applied when the minority class has less than 2 samples.")
        
        # Ensure n_neighbors does not exceed minority_count - 1
        n_neighbors = min(default_neighbors, minority_count - 1)
        return n_neighbors

    def implement_smote(self, X_train: pd.DataFrame, y_train: pd.Series) -> Tuple[pd.DataFrame, pd.Series]:
        """
        Implement SMOTE or its variants based on class imbalance with automated n_neighbors selection.

        Args:
            X_train (pd.DataFrame): Training features (transformed).
            y_train (pd.Series): Training target.

        Returns:
            Tuple[pd.DataFrame, pd.Series]: Resampled X_train and y_train.
        """
        step_name = "Implement SMOTE (Train Only)"
        self.logger.info(f"Step: {step_name}")

        # Check if classification
        if self.model_category != 'classification':
            self.logger.info("SMOTE not applicable: Not a classification model.")
            self.preprocessing_steps.append("SMOTE Skipped")
            return X_train, y_train

        # Calculate class distribution
        class_counts = y_train.value_counts()
        if len(class_counts) < 2:
            self.logger.warning("SMOTE not applicable: Only one class present.")
            self.preprocessing_steps.append("SMOTE Skipped")
            return X_train, y_train

        majority_class = class_counts.idxmax()
        minority_class = class_counts.idxmin()
        majority_count = class_counts.max()
        minority_count = class_counts.min()
        imbalance_ratio = minority_count / majority_count
        self.logger.info(f"Class Distribution before SMOTE: {class_counts.to_dict()}")
        self.logger.info(f"Imbalance Ratio (Minority/Majority): {imbalance_ratio:.4f}")

        # Determine SMOTE variant based on dataset composition
        has_numericals = len(self.numericals) > 0
        has_categoricals = len(self.ordinal_categoricals) + len(self.nominal_categoricals) > 0

        # Automatically select SMOTE variant
        if has_numericals and has_categoricals:
            smote_variant = 'SMOTENC'
            self.logger.info("Dataset contains both numerical and categorical features. Using SMOTENC.")
        elif has_numericals and not has_categoricals:
            smote_variant = 'SMOTE'
            self.logger.info("Dataset contains only numerical features. Using SMOTE.")
        elif has_categoricals and not has_numericals:
            smote_variant = 'SMOTEN'
            self.logger.info("Dataset contains only categorical features. Using SMOTEN.")
        else:
            smote_variant = 'SMOTE'  # Fallback
            self.logger.info("Feature composition unclear. Using SMOTE as default.")

        # Initialize SMOTE based on the variant
        try:
            if smote_variant == 'SMOTENC':
                if not self.categorical_indices:
                    # Determine categorical indices if not already set
                    categorical_features = []
                    for name, transformer, features in self.pipeline.transformers_:
                        if 'ord' in name or 'nominal' in name:
                            if isinstance(transformer, Pipeline):
                                encoder = transformer.named_steps.get('ordinal_encoder') or transformer.named_steps.get('onehot_encoder')
                                if hasattr(encoder, 'categories_'):
                                    # Calculate indices based on transformers order
                                    # This can be complex; for simplicity, assuming categorical features are the first
                                    categorical_features.extend(range(len(features)))
                    self.categorical_indices = categorical_features
                    self.logger.debug(f"Categorical feature indices for SMOTENC: {self.categorical_indices}")
                n_neighbors = self.determine_n_neighbors(minority_count, default_neighbors=5)
                smote = SMOTENC(categorical_features=self.categorical_indices, random_state=42, k_neighbors=n_neighbors)
                self.logger.debug(f"Initialized SMOTENC with categorical features indices: {self.categorical_indices} and n_neighbors={n_neighbors}")
            elif smote_variant == 'SMOTEN':
                n_neighbors = self.determine_n_neighbors(minority_count, default_neighbors=5)
                smote = SMOTEN(random_state=42, n_neighbors=n_neighbors)
                self.logger.debug(f"Initialized SMOTEN with n_neighbors={n_neighbors}")
            else:
                n_neighbors = self.determine_n_neighbors(minority_count, default_neighbors=5)
                smote = SMOTE(random_state=42, k_neighbors=n_neighbors)
                self.logger.debug(f"Initialized SMOTE with n_neighbors={n_neighbors}")
        except ValueError as ve:
            self.logger.error(f"❌ SMOTE initialization failed: {ve}")
            raise
        except Exception as e:
            self.logger.error(f"❌ Unexpected error during SMOTE initialization: {e}")
            raise

        # Apply SMOTE
        try:
            X_resampled, y_resampled = smote.fit_resample(X_train, y_train)
            self.logger.info(f"Applied {smote_variant}. Resampled dataset shape: {X_resampled.shape}")
            self.preprocessing_steps.append("Implement SMOTE")
            self.smote = smote  # Assign to self for saving
            self.logger.debug(f"Selected n_neighbors for SMOTE: {n_neighbors}")
            return X_resampled, y_resampled
        except Exception as e:
            self.logger.error(f"❌ SMOTE application failed: {e}")
            raise

    def inverse_transform_data(self, X_transformed: np.ndarray, original_data: Optional[pd.DataFrame] = None) -> pd.DataFrame:
        """
        Perform inverse transformation on the transformed data to reconstruct original feature values.

        Args:
            X_transformed (np.ndarray): The transformed feature data.
            original_data (Optional[pd.DataFrame]): The original data before transformation.

        Returns:
            pd.DataFrame: The inverse-transformed DataFrame including passthrough columns.
        """
        if self.pipeline is None:
            self.logger.error("Preprocessing pipeline has not been fitted. Cannot perform inverse transformation.")
            raise AttributeError("Preprocessing pipeline has not been fitted. Cannot perform inverse transformation.")

        preprocessor = self.pipeline
        logger = logging.getLogger('InverseTransform')
        if self.debug or self.get_debug_flag('debug_final_inverse_transformations'):
            logger.setLevel(logging.DEBUG)
        else:
            logger.setLevel(logging.INFO)

        logger.debug(f"[DEBUG Inverse] Starting inverse transformation. Input shape: {X_transformed.shape}")

        # Initialize variables
        inverse_data = {}
        transformations_applied = False  # Flag to check if any transformations are applied
        start_idx = 0  # Starting index for slicing

        # Iterate over each transformer in the ColumnTransformer
        for name, transformer, features in preprocessor.transformers_:
            if name == 'remainder':
                logger.debug(f"[DEBUG Inverse] Skipping 'remainder' transformer (passthrough columns).")
                continue  # Skip passthrough columns

            end_idx = start_idx + len(features)
            logger.debug(f"[DEBUG Inverse] Transformer '{name}' handling features {features} with slice {start_idx}:{end_idx}")

            # Check if the transformer has an inverse_transform method
            if hasattr(transformer, 'named_steps'):
                # Access the last step in the pipeline (e.g., scaler or encoder)
                last_step = list(transformer.named_steps.keys())[-1]
                inverse_transformer = transformer.named_steps[last_step]

                if hasattr(inverse_transformer, 'inverse_transform'):
                    transformed_slice = X_transformed[:, start_idx:end_idx]
                    inverse_slice = inverse_transformer.inverse_transform(transformed_slice)

                    # Assign inverse-transformed data to the corresponding feature names
                    for idx, feature in enumerate(features):
                        inverse_data[feature] = inverse_slice[:, idx]

                    logger.debug(f"[DEBUG Inverse] Applied inverse_transform on transformer '{last_step}' for features {features}.")
                    transformations_applied = True
                else:
                    logger.debug(f"[DEBUG Inverse] Transformer '{last_step}' does not support inverse_transform. Skipping.")
            else:
                logger.debug(f"[DEBUG Inverse] Transformer '{name}' does not have 'named_steps'. Skipping.")

            start_idx = end_idx  # Update starting index for next transformer

        # Convert the inverse_data dictionary to a DataFrame
        if transformations_applied:
            inverse_df = pd.DataFrame(inverse_data, index=original_data.index if original_data is not None else None)
            logger.debug(f"[DEBUG Inverse] Inverse DataFrame shape (transformed columns): {inverse_df.shape}")
            logger.debug(f"[DEBUG Inverse] Sample of inverse-transformed data:\n{inverse_df.head()}")
        else:
            if original_data is not None:
                logger.warning("⚠️ No reversible transformations were applied. Returning original data.")
                inverse_df = original_data.copy()
                logger.debug(f"[DEBUG Inverse] Returning a copy of original_data with shape: {inverse_df.shape}")
            else:
                logger.error("❌ No transformations were applied and original_data was not provided. Cannot perform inverse transformation.")
                raise ValueError("No transformations were applied and original_data was not provided.")

        # Identify passthrough columns by excluding transformed features
        if original_data is not None and transformations_applied:
            transformed_features = set(inverse_data.keys())
            all_original_features = set(original_data.columns)
            passthrough_columns = list(all_original_features - transformed_features)
            logger.debug(f"[DEBUG Inverse] Inverse DataFrame columns before pass-through merge: {inverse_df.columns.tolist()}")
            logger.debug(f"[DEBUG Inverse] all_original_features: {list(all_original_features)}")
            logger.debug(f"[DEBUG Inverse] passthrough_columns: {passthrough_columns}")

            if passthrough_columns:
                logger.debug(f"[DEBUG Inverse] Passthrough columns to merge: {passthrough_columns}")
                passthrough_data = original_data[passthrough_columns].copy()
                inverse_df = pd.concat([inverse_df, passthrough_data], axis=1)

                # Ensure the final DataFrame has the same column order as original_data
                inverse_df = inverse_df[original_data.columns]
                logger.debug(f"[DEBUG Inverse] Final inverse DataFrame shape: {inverse_df.shape}")
                
                # Check for missing columns after inverse transform
                expected_columns = set(original_data.columns)
                final_columns = set(inverse_df.columns)
                missing_after_inverse = expected_columns - final_columns

                if missing_after_inverse:
                    err_msg = (
                    f"Inverse transform error: The following columns are missing "
                    f"after inverse transform: {missing_after_inverse}"
                    )
                    logger.error(err_msg)
                    raise ValueError(err_msg)
            else:
                logger.debug("[DEBUG Inverse] No passthrough columns to merge.")
        else:
            logger.debug("[DEBUG Inverse] Either no original_data provided or no transformations were applied.")

        return inverse_df



    def build_pipeline(self, X_train: pd.DataFrame) -> ColumnTransformer:
        transformers = []

        # Handle Numerical Features
        if self.numericals:
            numerical_strategy = self.options.get('handle_missing_values', {}).get('numerical_strategy', {}).get('strategy', 'median')
            numerical_imputer = self.options.get('handle_missing_values', {}).get('numerical_strategy', {}).get('imputer', 'SimpleImputer')

            if numerical_imputer == 'SimpleImputer':
                num_imputer = SimpleImputer(strategy=numerical_strategy)
            elif numerical_imputer == 'KNNImputer':
                knn_neighbors = self.options.get('handle_missing_values', {}).get('numerical_strategy', {}).get('knn_neighbors', 5)
                num_imputer = KNNImputer(n_neighbors=knn_neighbors)
            else:
                raise ValueError(f"Unsupported numerical imputer type: {numerical_imputer}")

            # Determine scaling method
            scaling_method = self.options.get('apply_scaling', {}).get('method', None)
            if scaling_method is None:
                # Default scaling based on model category
                if self.model_category in ['regression', 'classification', 'clustering']:
                    # For clustering, MinMaxScaler is generally preferred
                    if self.model_category == 'clustering':
                        scaler = MinMaxScaler()
                        scaling_type = 'MinMaxScaler'
                    else:
                        scaler = StandardScaler()
                        scaling_type = 'StandardScaler'
                else:
                    scaler = 'passthrough'
                    scaling_type = 'None'
            else:
                # Normalize the scaling_method string to handle case-insensitivity
                scaling_method_normalized = scaling_method.lower()
                if scaling_method_normalized == 'standardscaler':
                    scaler = StandardScaler()
                    scaling_type = 'StandardScaler'
                elif scaling_method_normalized == 'minmaxscaler':
                    scaler = MinMaxScaler()
                    scaling_type = 'MinMaxScaler'
                elif scaling_method_normalized == 'robustscaler':
                    scaler = RobustScaler()
                    scaling_type = 'RobustScaler'
                elif scaling_method_normalized == 'none':
                    scaler = 'passthrough'
                    scaling_type = 'None'
                else:
                    raise ValueError(f"Unsupported scaling method: {scaling_method}")

            numerical_transformer = Pipeline(steps=[
                ('imputer', num_imputer),
                ('scaler', scaler)
            ])

            transformers.append(('num', numerical_transformer, self.numericals))
            self.logger.debug(f"Numerical transformer added with imputer '{numerical_imputer}' and scaler '{scaling_type}'.")

        # Handle Ordinal Categorical Features
        if self.ordinal_categoricals:
            ordinal_strategy = self.options.get('encode_categoricals', {}).get('ordinal_encoding', 'OrdinalEncoder')
            if ordinal_strategy == 'OrdinalEncoder':
                ordinal_transformer = Pipeline(steps=[
                    ('imputer', SimpleImputer(strategy='most_frequent')),
                    ('ordinal_encoder', OrdinalEncoder())
                ])
                transformers.append(('ord', ordinal_transformer, self.ordinal_categoricals))
                self.logger.debug("Ordinal transformer added with OrdinalEncoder.")
            else:
                raise ValueError(f"Unsupported ordinal encoding strategy: {ordinal_strategy}")

        # Handle Nominal Categorical Features
        if self.nominal_categoricals:
            nominal_strategy = self.options.get('encode_categoricals', {}).get('nominal_encoding', 'OneHotEncoder')
            if nominal_strategy == 'OneHotEncoder':
                nominal_transformer = Pipeline(steps=[
                    ('imputer', SimpleImputer(strategy='most_frequent')),
                    ('onehot_encoder', OneHotEncoder(handle_unknown='ignore', sparse=False))
                ])
                transformers.append(('nominal', nominal_transformer, self.nominal_categoricals))
                self.logger.debug("Nominal transformer added with OneHotEncoder.")
            elif nominal_strategy == 'OrdinalEncoder':
                nominal_transformer = Pipeline(steps=[
                    ('imputer', SimpleImputer(strategy='most_frequent')),
                    ('ordinal_encoder', OrdinalEncoder())
                ])
                transformers.append(('nominal_ord', nominal_transformer, self.nominal_categoricals))
                self.logger.debug("Nominal transformer added with OrdinalEncoder.")
            elif nominal_strategy == 'FrequencyEncoder':
                # Implement custom Frequency Encoding
                for feature in self.nominal_categoricals:
                    freq = X_train[feature].value_counts(normalize=True)
                    X_train[feature] = X_train[feature].map(freq)
                    self.feature_reasons[feature] += 'Frequency Encoding applied | '
                    self.logger.debug(f"Frequency Encoding applied to '{feature}'.")
            else:
                raise ValueError(f"Unsupported nominal encoding strategy: {nominal_strategy}")

        if not transformers and 'FrequencyEncoder' not in nominal_strategy:
            self.logger.error("No transformers added to the pipeline. Check feature categorization and configuration.")
            raise ValueError("No transformers added to the pipeline. Check feature categorization and configuration.")

        preprocessor = ColumnTransformer(transformers=transformers, remainder='passthrough')
        self.logger.debug("ColumnTransformer constructed with the following transformers:")
        for t in transformers:
            self.logger.debug(t)

        preprocessor.fit(X_train)
        self.logger.info("✅ Preprocessor fitted on training data.")

        # Determine categorical feature indices for SMOTENC if needed
        if self.options.get('implement_smote', {}).get('variant', None) == 'SMOTENC':
            if not self.categorical_indices:
                categorical_features = []
                for name, transformer, features in preprocessor.transformers_:
                    if 'ord' in name or 'nominal' in name:
                        if isinstance(transformer, Pipeline):
                            encoder = transformer.named_steps.get('ordinal_encoder') or transformer.named_steps.get('onehot_encoder')
                            if hasattr(encoder, 'categories_'):
                                # Calculate indices based on transformers order
                                # This can be complex; for simplicity, assuming categorical features are the first
                                categorical_features.extend(range(len(features)))
                self.categorical_indices = categorical_features
                self.logger.debug(f"Categorical feature indices for SMOTENC: {self.categorical_indices}")

        return preprocessor

    # NEW: Phase-Aware Normalization (generic for any group such as 'phase', 'workout_type', etc.)
    def phase_scaling(self, df: pd.DataFrame, numeric_cols: List[str], group_column: str) -> Tuple[pd.DataFrame, Dict]:
        """
        Normalize numeric features within each group (e.g. each phase) using RobustScaler.
        Logs summary statistics before and after scaling.

        Args:
            df (pd.DataFrame): Input DataFrame.
            numeric_cols (List[str]): List of numeric columns to scale.
            group_column (str): The column used for grouping (e.g. 'phase').

        Returns:
            Tuple[pd.DataFrame, Dict]: The DataFrame with scaled values and a dictionary of fitted scalers per group.
        """
        from sklearn.preprocessing import RobustScaler

        scalers = {}
        groups = df[group_column].unique()
        self.logger.info(f"Starting phase-aware normalization on column '{group_column}' for groups: {groups}")
        for grp in groups:
            phase_mask = df[group_column] == grp
            df_grp = df.loc[phase_mask, numeric_cols]
            # Log before scaling
            self.logger.debug(f"Before scaling for group '{grp}':\n{df_grp.describe()}")
            scaler = RobustScaler().fit(df_grp)
            df.loc[phase_mask, numeric_cols] = scaler.transform(df_grp)
            scalers[grp] = scaler
            # Log after scaling
            self.logger.debug(f"After scaling for group '{grp}':\n{df.loc[phase_mask, numeric_cols].describe()}")
        return df, scalers

    # NEW: Adaptive Window Calculation based on group duration statistics.
    @staticmethod
    def calculate_phase_window(phase_data: pd.DataFrame, base_size: int = 100, std_dev: int = 2) -> int:
        """
        Estimate an optimal window size for a given phase (or group) based on its duration statistics.
        
        Args:
            phase_data (pd.DataFrame): Data for a specific phase/group.
            base_size (int): Minimum window size.
            std_dev (int): Multiplier for standard deviation.
        
        Returns:
            int: Calculated window size.
        """
        # Assuming a grouping column exists (e.g., 'pitch_trial_id') to measure durations
        durations = phase_data.groupby('pitch_trial_id').size()
        avg = durations.mean()
        std = durations.std()
        return int(np.clip(avg + std_dev * std, base_size, 300))

    # NEW: Validation for target sequence alignment.
    @staticmethod
    def check_target_alignment(X_seq: Any, y_seq: Any, horizon: int) -> bool:
        """
        Check that each input sequence's last 'horizon' rows align with the target sequence length.
        
        Args:
            X_seq (Iterable): Iterable of input sequences.
            y_seq (Iterable): Iterable of target sequences.
            horizon (int): Expected target sequence length.
        
        Returns:
            bool: True if alignment is valid for all sequences; False otherwise.
        """
        for seq, target in zip(X_seq, y_seq):
            if seq[-horizon:].shape[0] != target.shape[0]:
                return False
        return True

    # NEW: Validation for phase (or group) transitions.
    @staticmethod
    def validate_phase_transitions(sequences: list, phase_column: str, valid_transitions: Dict[str, List[str]]) -> bool:
        """
        Validate that sequences contain biomechanically valid transitions between groups.
        
        Args:
            sequences (list): List of DataFrames or arrays that include a column for phases.
            phase_column (str): Name of the column that contains the group/phase information.
            valid_transitions (Dict[str, List[str]]): Dictionary mapping a phase to the list of allowed next phases.
        
        Returns:
            bool: True if the error rate is below the threshold, False otherwise.
        """
        errors = 0
        for seq in sequences:
            phases = pd.Series(seq[:, phase_column]) if isinstance(seq, np.ndarray) else seq[phase_column]
            phases = phases.unique()
            for i in range(len(phases) - 1):
                current = phases[i]
                next_phase = phases[i+1]
                if next_phase not in valid_transitions.get(current, []):
                    errors += 1
        # For simplicity, we define a tolerance (here <1% error)
        return errors / len(sequences) < 0.01

    # ---------------------------------------------------------------------
    # Updated preprocess_time_series: now includes an optional phase-aware normalization step.
    def preprocess_time_series(self, data: pd.DataFrame) -> Tuple[Any, None, Any, None, pd.DataFrame, None]:
        """
        Preprocess data specifically for time series models.
        
        Steps:
          1. Handle missing values and outliers.
          2. Sort the data by the time column.
          3. Optionally perform phase-aware normalization if enabled.
          4. Extract features and target.
          5. Build and fit the preprocessing pipeline.
          6. Transform the features.
          7. Create sequences:
             - If DTW is enabled and a grouping variable is provided, use create_sequences_by_category.
             - Otherwise, use sliding window segmentation.
          8. If grouping was used, apply hierarchical temporal encoding.
          9. (Optionally) Validate target alignment.
         10. Generate recommendations and save transformers.
        
        Returns:
            Tuple containing:
              - X_seq: Sequence array (or list) for time series inputs.
              - None for X_test.
              - y_seq: Sequence array for targets.
              - None for y_test.
              - recommendations: Preprocessing recommendations DataFrame.
              - None for inverse-transformed test data.
        """
        # Step 1: Handle missing values on the full DataFrame (features and target)
        data_clean, _ = self.handle_missing_values(data)
    
        # Step 2: Handle outliers.
        X_temp = data_clean.drop(columns=self.y_variable)
        y_temp = data_clean[self.y_variable]
        X_temp, y_temp = self.handle_outliers(X_temp, y_temp)
        data_clean = pd.concat([X_temp, y_temp], axis=1)
    
        # Step 3: Sort the data by the time column.
        if self.time_column is None:
            raise ValueError("For time series models, 'time_column' must be specified.")
        data_clean['__time__'] = pd.to_datetime(data_clean[self.time_column])
        data_sorted = data_clean.sort_values(by='__time__').drop(columns=['__time__'])
        assert all(col in data_sorted.columns for col in self.y_variable), "Target variable(s) missing after sorting!"
        self.logger.debug(f"Columns after sorting: {data_sorted.columns.tolist()}")

        # NEW: Optional Phase-Aware Normalization if enabled in options.
        phase_norm_opts = self.options.get('phase_aware_normalization', {})
        if phase_norm_opts.get('enabled', False):
            group_col = phase_norm_opts.get('group_column', 'phase')
            num_cols = phase_norm_opts.get('numeric_columns', self.numericals)
            self.logger.info(f"Phase-aware normalization enabled on group '{group_col}'.")
            # Log summary statistics before normalization
            self.logger.debug(f"Before phase scaling (for columns {num_cols}):\n{data_sorted.groupby(group_col)[num_cols].describe()}")
            data_sorted, phase_scalers = self.phase_scaling(data_sorted, num_cols, group_col)
            # Log summary statistics after normalization
            self.logger.debug(f"After phase scaling (for columns {num_cols}):\n{data_sorted.groupby(group_col)[num_cols].describe()}")
    
        # Step 4: Extract features (X_clean) and target (y_clean) after sorting.
        X_clean = data_sorted.drop(columns=self.y_variable)
        y_clean = data_sorted[self.y_variable]
    
        # Step 5: Build and fit the preprocessing pipeline on the features.
        self.pipeline = self.build_pipeline(X_clean)
        X_preprocessed = self.pipeline.fit_transform(X_clean)
    
        # --- MODIFIED SEGMENTATION STEP ---
        if self.use_dtw:
            if self.sequence_categorical is not None:
                # When multiple grouping variables are provided, extract them from the sorted DataFrame.
                if isinstance(self.sequence_categorical, list) and len(self.sequence_categorical) > 1:
                    group_ids = data_sorted[self.sequence_categorical].values  # 2D array
                else:
                    group_ids = data_sorted[self.sequence_categorical[0]].values  # 1D array
                self.logger.info(f"DTW is enabled; using grouping-based segmentation with grouping keys: {self.sequence_categorical}.")
                X_seq, y_seq, group_keys = self.create_sequences_by_category(X_preprocessed, y_clean.values, group_ids)
                # Apply hierarchical temporal encoding to add one-hot and positional features.
                X_seq = self.temporal_encode_sequences(X_seq, group_keys)
            else:
                self.logger.warning("DTW is enabled but no grouping variable provided. Treating entire session as one group.")
                group_ids = np.ones(len(data_sorted), dtype=int)
                X_seq, y_seq, _ = self.create_sequences_by_category(X_preprocessed, y_clean.values, group_ids)
        else:
            # Sliding window segmentation (no grouping-based temporal encoding).
            X_seq, y_seq = self.create_sequences(X_preprocessed, y_clean.values)
        # --- END MODIFIED SEGMENTATION STEP ---
    
        # NEW: Validate that each sequence's target length matches expectations.
        if not self.check_target_alignment(X_seq, y_seq, self.horizon):
            self.logger.warning("⚠️ Target alignment check failed: Some sequences may not have matching target lengths.")
        else:
            self.logger.debug("Target alignment check passed for all sequences.")
    
        # Step 10: Generate recommendations and save transformers.
        recommendations = self.generate_recommendations()
        self.final_feature_order = list(self.pipeline.get_feature_names_out())
        self.save_transformers()
    
        return X_seq, None, y_seq, None, recommendations, None





    def preprocess_train(self, X: pd.DataFrame, y: pd.Series) -> Tuple[pd.DataFrame, pd.DataFrame, pd.Series, pd.Series, pd.DataFrame, Optional[pd.DataFrame]]:
        """
        Preprocess training data for various model types.
        For time series models, delegate to preprocess_time_series.
        
        Returns:
            - For standard models: X_train_final, X_test_final, y_train_smoted, y_test, recommendations, X_test_inverse.
            - For time series models: X_seq, None, y_seq, None, recommendations, None.
        """
        # If the model is time series, use the dedicated time series preprocessing flow.
        if self.model_category == 'time_series':
            return self.preprocess_time_series(X, y)
        
        # Standard preprocessing flow for classification/regression/clustering
        X_train_original, X_test_original, y_train_original, y_test = self.split_dataset(X, y)
        X_train_missing_values, X_test_missing_values = self.handle_missing_values(X_train_original, X_test_original)
        
        # Only perform normality tests if applicable
        if self.model_category in ['regression', 'classification', 'clustering']:
            self.test_normality(X_train_missing_values)
        
        X_train_outliers_handled, y_train_outliers_handled = self.handle_outliers(X_train_missing_values, y_train_original)
        X_test_outliers_handled = X_test_missing_values.copy() if X_test_missing_values is not None else None
        recommendations = self.generate_recommendations()
        self.pipeline = self.build_pipeline(X_train_outliers_handled)
        X_train_preprocessed = self.pipeline.fit_transform(X_train_outliers_handled)
        X_test_preprocessed = self.pipeline.transform(X_test_outliers_handled) if X_test_outliers_handled is not None else None

        if self.model_category == 'classification':
            try:
                X_train_smoted, y_train_smoted = self.implement_smote(X_train_preprocessed, y_train_outliers_handled)
            except Exception as e:
                self.logger.error(f"❌ SMOTE application failed: {e}")
                raise
        else:
            X_train_smoted, y_train_smoted = X_train_preprocessed, y_train_outliers_handled
            self.logger.info("⚠️ SMOTE not applied: Not a classification model.")

        self.final_feature_order = list(self.pipeline.get_feature_names_out())
        X_train_final = pd.DataFrame(X_train_smoted, columns=self.final_feature_order)
        X_test_final = pd.DataFrame(X_test_preprocessed, columns=self.final_feature_order, index=X_test_original.index) if X_test_preprocessed is not None else None

        try:
            self.save_transformers()
        except Exception as e:
            self.logger.error(f"❌ Saving transformers failed: {e}")
            raise

        try:
            if X_test_final is not None:
                X_test_inverse = self.inverse_transform_data(X_test_final.values, original_data=X_test_original)
                self.logger.info("✅ Inverse transformations applied successfully.")
            else:
                X_test_inverse = None
        except Exception as e:
            self.logger.error(f"❌ Inverse transformations failed: {e}")
            X_test_inverse = None

        return X_train_final, X_test_final, y_train_smoted, y_test, recommendations, X_test_inverse


    def transform(self, X: pd.DataFrame) -> np.ndarray:
        """
        Transform new data using the fitted preprocessing pipeline.

        Args:
            X (pd.DataFrame): New data to transform.

        Returns:
            np.ndarray: Preprocessed data.
        """
        if self.pipeline is None:
            self.logger.error("Preprocessing pipeline has not been fitted. Cannot perform inverse transformation.")
            raise AttributeError("Preprocessing pipeline has not been fitted. Cannot perform inverse transformation.")
        self.logger.debug("Transforming new data.")
        X_preprocessed = self.pipeline.transform(X)
        if self.debug:
            self.logger.debug(f"Transformed data shape: {X_preprocessed.shape}")
        else:
            self.logger.info("Data transformed.")
        return X_preprocessed

    def preprocess_predict(self, X: pd.DataFrame) -> Tuple[pd.DataFrame, pd.DataFrame, Optional[pd.DataFrame]]:
        """
        Preprocess new data for prediction.

        Args:
            X (pd.DataFrame): New data for prediction.

        Returns:
            Tuple[pd.DataFrame, pd.DataFrame, Optional[pd.DataFrame]]: X_preprocessed, recommendations, X_inversed
        """
        step_name = "Preprocess Predict"
        self.logger.info(f"Step: {step_name}")

        # Log initial columns and feature count
        self.logger.debug(f"Initial columns in prediction data: {X.columns.tolist()}")
        self.logger.debug(f"Initial number of features: {X.shape[1]}")

        # Load transformers
        try:
            transformers = self.load_transformers()
            self.logger.debug("Transformers loaded successfully.")
        except Exception as e:
            self.logger.error(f"❌ Failed to load transformers: {e}")
            raise

        # Filter columns based on raw feature names
        try:
            X_filtered = self.filter_columns(X)
            self.logger.debug(f"Columns after filtering: {X_filtered.columns.tolist()}")
            self.logger.debug(f"Number of features after filtering: {X_filtered.shape[1]}")
        except Exception as e:
            self.logger.error(f"❌ Failed during column filtering: {e}")
            raise

        # Handle missing values
        try:
            X_filtered, _ = self.handle_missing_values(X_filtered)
            self.logger.debug(f"Columns after handling missing values: {X_filtered.columns.tolist()}")
            self.logger.debug(f"Number of features after handling missing values: {X_filtered.shape[1]}")
        except Exception as e:
            self.logger.error(f"❌ Failed during missing value handling: {e}")
            raise

        # Ensure all expected raw features are present
        expected_raw_features = self.numericals + self.ordinal_categoricals + self.nominal_categoricals
        provided_features = X_filtered.columns.tolist()

        self.logger.debug(f"Expected raw features: {expected_raw_features}")
        self.logger.debug(f"Provided features: {provided_features}")

        missing_raw_features = set(expected_raw_features) - set(provided_features)
        if missing_raw_features:
            self.logger.error(f"❌ Missing required raw feature columns in prediction data: {missing_raw_features}")
            raise ValueError(f"Missing required raw feature columns in prediction data: {missing_raw_features}")

        # Handle unexpected columns (optional: ignore or log)
        unexpected_features = set(provided_features) - set(expected_raw_features)
        if unexpected_features:
            self.logger.warning(f"⚠️ Unexpected columns in prediction data that will be ignored: {unexpected_features}")

        # Ensure the order of columns matches the pipeline's expectation (optional)
        X_filtered = X_filtered[expected_raw_features]
        self.logger.debug("Reordered columns to match the pipeline's raw feature expectations.")

        # Transform data using the loaded pipeline
        try:
            X_preprocessed_np = self.pipeline.transform(X_filtered)
            self.logger.debug(f"Transformed data shape: {X_preprocessed_np.shape}")
        except Exception as e:
            self.logger.error(f"❌ Transformation failed: {e}")
            raise

        # Retrieve feature names from the pipeline or use stored final_feature_order
        if hasattr(self.pipeline, 'get_feature_names_out'):
            try:
                columns = self.pipeline.get_feature_names_out()
                self.logger.debug(f"Derived feature names from pipeline: {columns.tolist()}")
            except Exception as e:
                self.logger.warning(f"Could not retrieve feature names from pipeline: {e}")
                columns = self.final_feature_order
                self.logger.debug(f"Using stored final_feature_order for column names: {columns}")
        else:
            columns = self.final_feature_order
            self.logger.debug(f"Using stored final_feature_order for column names: {columns}")

        # Convert NumPy array back to DataFrame with correct column names
        try:
            X_preprocessed_df = pd.DataFrame(X_preprocessed_np, columns=columns, index=X_filtered.index)
            self.logger.debug(f"X_preprocessed_df columns: {X_preprocessed_df.columns.tolist()}")
            self.logger.debug(f"Sample of X_preprocessed_df:\n{X_preprocessed_df.head()}")
        except Exception as e:
            self.logger.error(f"❌ Failed to convert transformed data to DataFrame: {e}")
            raise

        # Inverse transform for interpretability (optional, for interpretability)
        try:
            self.logger.debug(f"[DEBUG] Original data shape before inverse transform: {X.shape}")
            X_inversed = self.inverse_transform_data(X_preprocessed_np, original_data=X)
            self.logger.debug(f"[DEBUG] Inversed data shape: {X_inversed.shape}")
        except Exception as e:
            self.logger.error(f"❌ Inverse transformation failed: {e}")
            X_inversed = None

        # Generate recommendations (if applicable)
        try:
            recommendations = self.generate_recommendations()
            self.logger.debug("Generated preprocessing recommendations.")
        except Exception as e:
            self.logger.error(f"❌ Failed to generate recommendations: {e}")
            recommendations = pd.DataFrame()

        # Prepare outputs
        return X_preprocessed_df, recommendations, X_inversed

    def preprocess_clustering(self, X: pd.DataFrame) -> Tuple[pd.DataFrame, pd.DataFrame]:
        """
        Preprocess data for clustering mode.

        Args:
            X (pd.DataFrame): Input features for clustering.

        Returns:
            Tuple[pd.DataFrame, pd.DataFrame]: X_processed, recommendations.
        """
        step_name = "Preprocess Clustering"
        self.logger.info(f"Step: {step_name}")
        debug_flag = self.get_debug_flag('debug_handle_missing_values')  # Use relevant debug flags

        # Handle Missing Values
        X_missing, _ = self.handle_missing_values(X, None)
        self.logger.debug(f"After handling missing values: X_missing.shape={X_missing.shape}")

        # Handle Outliers
        X_outliers_handled, _ = self.handle_outliers(X_missing, None)
        self.logger.debug(f"After handling outliers: X_outliers_handled.shape={X_outliers_handled.shape}")

        # Test Normality (optional for clustering)
        if self.model_category in ['clustering']:
            self.logger.info("Skipping normality tests for clustering.")
        else:
            self.test_normality(X_outliers_handled)

        # Generate Preprocessing Recommendations
        recommendations = self.generate_recommendations()

        # Build and Fit the Pipeline
        self.pipeline = self.build_pipeline(X_outliers_handled)
        self.logger.debug("Pipeline built and fitted.")

        # Transform the data
        X_processed = self.pipeline.transform(X_outliers_handled)
        self.logger.debug(f"After pipeline transform: X_processed.shape={X_processed.shape}")

        # Optionally, inverse transformations can be handled if necessary

        # Save Transformers (if needed)
        # Not strictly necessary for clustering unless you plan to apply the same preprocessing on new data
        self.save_transformers()

        self.logger.info("✅ Clustering data preprocessed successfully.")

        return X_processed, recommendations

    def final_preprocessing(self, data: pd.DataFrame) -> Tuple:
        """
        Execute the full preprocessing pipeline based on the mode.

        For 'train' mode:
        - If time series: pass the full filtered DataFrame (which includes the target) 
            to preprocess_time_series.
        - Else: split the data into X and y, then call preprocess_train.
        For 'predict' and 'clustering' modes, the existing flow remains unchanged.

        Returns:
            Tuple: Depending on mode:
                - 'train': For standard models: X_train, X_test, y_train, y_test, recommendations, X_test_inverse.
                            For time series models: X_seq, None, y_seq, None, recommendations, None.
                - 'predict': X_preprocessed, recommendations, X_inverse.
                - 'clustering': X_processed, recommendations.
        """
        self.logger.info(f"Starting: Final Preprocessing Pipeline in '{self.mode}' mode.")
        
        try:
            data = self.filter_columns(data)
            self.logger.info("✅ Column filtering completed successfully.")
        except Exception as e:
            self.logger.error(f"❌ Column filtering failed: {e}")
            raise

        if self.mode == 'train':
            if self.model_category == 'time_series':
                # For time series mode, do not split the DataFrame.
                # Pass the full filtered data (which still contains the target variable)
                # so that the time series preprocessing flow can extract the target after cleaning and sorting.
                return self.preprocess_time_series(data)
            else:
                if not all(col in data.columns for col in self.y_variable):
                    missing_y = [col for col in self.y_variable if col not in data.columns]
                    raise ValueError(f"Target variable(s) {missing_y} not found in the dataset.")
                X = data.drop(self.y_variable, axis=1)
                y = data[self.y_variable].iloc[:, 0] if len(self.y_variable) == 1 else data[self.y_variable]
                return self.preprocess_train(X, y)
        
        elif self.mode == 'predict':
            X = data.copy()
            transformers_path = os.path.join(self.transformers_dir, 'transformers.pkl')
            if not os.path.exists(transformers_path):
                self.logger.error(f"❌ Transformers file not found at '{self.transformers_dir}'. Cannot proceed with prediction.")
                raise FileNotFoundError(f"Transformers file not found at '{self.transformers_dir}'.")
            X_preprocessed, recommendations, X_inversed = self.preprocess_predict(X)
            self.logger.info("✅ Preprocessing completed successfully in predict mode.")
            return X_preprocessed, recommendations, X_inversed
        
        elif self.mode == 'clustering':
            X = data.copy()
            return self.preprocess_clustering(X)
        
        else:
            raise NotImplementedError(f"Mode '{self.mode}' is not implemented.")



    # Optionally, implement a method to display column info for debugging
    def _debug_column_info(self, df: pd.DataFrame, step: str = "Debug Column Info"):
        """
        Display information about DataFrame columns for debugging purposes.

        Args:
            df (pd.DataFrame): The DataFrame to inspect.
            step (str, optional): Description of the current step. Defaults to "Debug Column Info".
        """
        self.logger.debug(f"\n📊 {step}: Column Information")
        for col in df.columns:
            self.logger.debug(f"Column '{col}': {df[col].dtype}, Unique Values: {df[col].nunique()}")
        self.logger.debug("\n")



#---------------------------------


import numpy as np

def dtw_path(s1: np.ndarray, s2: np.ndarray) -> list:
    """
    Compute the DTW cost matrix and return the optimal warping path.
    
    Args:
        s1: Sequence 1, shape (n, features)
        s2: Sequence 2, shape (m, features)
    
    Returns:
        path: A list of index pairs [(i, j), ...] indicating the alignment.
    """
    n, m = len(s1), len(s2)
    cost = np.full((n+1, m+1), np.inf)
    cost[0, 0] = 0

    # Build the cost matrix
    for i in range(1, n+1):
        for j in range(1, m+1):
            dist = np.linalg.norm(s1[i-1] - s2[j-1])
            cost[i, j] = dist + min(cost[i-1, j], cost[i, j-1], cost[i-1, j-1])

    # Backtracking to find the optimal path
    i, j = n, m
    path = []
    while i > 0 and j > 0:
        path.append((i-1, j-1))
        directions = [cost[i-1, j], cost[i, j-1], cost[i-1, j-1]]
        min_index = np.argmin(directions)
        if min_index == 0:
            i -= 1
        elif min_index == 1:
            j -= 1
        else:
            i -= 1
            j -= 1
    path.reverse()
    return path

def warp_sequence(seq: np.ndarray, path: list, target_length: int) -> np.ndarray:
    """
    Warp the given sequence to match the target length based on the DTW warping path.
    
    Args:
        seq: Original sequence, shape (n, features)
        path: Warping path from dtw_path (list of tuples)
        target_length: Desired sequence length (typically the reference length)
    
    Returns:
        aligned_seq: Warped sequence with shape (target_length, features)
    """
    aligned_seq = np.zeros((target_length, seq.shape[1]))
    # Create mapping: for each target index, collect corresponding indices from seq
    mapping = {t: [] for t in range(target_length)}
    for (i, j) in path:
        mapping[j].append(i)
    
    for t in range(target_length):
        indices = mapping[t]
        if indices:
            aligned_seq[t] = np.mean(seq[indices], axis=0)
        else:
            # If no alignment, reuse the previous value (or use interpolation)
            aligned_seq[t] = aligned_seq[t-1] if t > 0 else seq[0]
    return aligned_seq

# scripts/model_factory.py

from sklearn.ensemble import RandomForestClassifier, RandomForestRegressor
from sklearn.tree import DecisionTreeClassifier, DecisionTreeRegressor
from sklearn.linear_model import LogisticRegression, LinearRegression
from sklearn.svm import SVC, SVR
from xgboost import XGBClassifier, XGBRegressor
from sklearn.cluster import KMeans
from pathlib import Path
import joblib
import logging
# from datapreprocessor import DataPreprocessor # Importing the DataPreprocessor class from datapreprocessor.py

logger = logging.getLogger(__name__)

def get_model(model_type: str, model_sub_type: str):
    """Factory function to get model instances based on the model type and subtype."""
    if model_type == "Tree Based Classifier":
        if model_sub_type == "Random Forest":
            return RandomForestClassifier(n_estimators=100, random_state=42, n_jobs=-1)
        elif model_sub_type == "XGBoost":
            return XGBClassifier(eval_metric='logloss', random_state=42)
        elif model_sub_type == "Decision Tree":
            return DecisionTreeClassifier(random_state=42)
        else:
            raise ValueError(f"Classifier subtype '{model_sub_type}' is not supported under '{model_type}'.")
    
    elif model_type == "Logistic Regression":
        if model_sub_type == "Logistic Regression":
            return LogisticRegression(random_state=42, max_iter=1000)
        else:
            raise ValueError(f"Subtype '{model_sub_type}' is not supported under '{model_type}'.")
    
    elif model_type == "K-Means":
        if model_sub_type == "K-Means":
            return KMeans(n_clusters=3, init='k-means++', n_init=10, max_iter=300, random_state=42)
        else:
            raise ValueError(f"Clustering subtype '{model_sub_type}' is not supported under '{model_type}'.")
    
    elif model_type == "Linear Regression":
        if model_sub_type == "Linear Regression":
            return LinearRegression()
        else:
            raise ValueError(f"Subtype '{model_sub_type}' is not supported under '{model_type}'.")
    
    elif model_type == "Tree Based Regressor":
        if model_sub_type == "Random Forest Regressor":
            return RandomForestRegressor(n_estimators=100, random_state=42, n_jobs=-1)
        elif model_sub_type == "XGBoost Regressor":
            return XGBRegressor(eval_metric='rmse', random_state=42)
        elif model_sub_type == "Decision Tree Regressor":
            return DecisionTreeRegressor(random_state=42)
        else:
            raise ValueError(f"Regressor subtype '{model_sub_type}' is not supported under '{model_type}'.")
    
    elif model_type == "Support Vector Machine":
        if model_sub_type == "Support Vector Machine":
            return SVC(probability=True, random_state=42)
        else:
            raise ValueError(f"SVM subtype '{model_sub_type}' is not supported under '{model_type}'.")
    
    else:
        raise ValueError(f"Model type '{model_type}' is not supported.")

def estimate_optimal_window_size(time_series: pd.Series, threshold: float = 0.5, max_window: int = 100) -> int:
    """
    Estimate an optimal window size based on when the autocorrelation drops below a threshold.
    
    Args:
        time_series: A pandas Series representing the raw time series data.
        threshold: Autocorrelation threshold (default 0.5).
        max_window: Maximum window length to consider.
    
    Returns:
        Optimal window size as an integer.
    """
    for lag in range(1, max_window + 1):
        ac = time_series.autocorr(lag=lag)
        if ac < threshold:
            return lag
    return max_window

import pandas as pd
import numpy as np
import yaml
from pathlib import Path


def load_config(config_path: Path) -> dict:
    with open(config_path, 'r') as f:
        config = yaml.safe_load(f)
    return config

# def main_time_series():
#     # Load configuration from the updated YAML file
#     config_path = Path("../../dataset/test/preprocessor_config/preprocessor_config_timeseries.yaml")
#     config = load_config(config_path)
    
#     # Extract time series parameters from the config
#     ts_params = config.get("time_series", {})
#     if not ts_params.get("enabled", False):
#         print("[INFO] Time series processing is not enabled in the config.")
#         return

    # # Set dataset path based on config
    # data_dir = Path(config["paths"]["data_dir"])
    # raw_data_file = config["paths"]["raw_data"]
    # raw_data_path = data_dir / raw_data_file
    
    # # Load dataset
    # try:
    #     df = pd.read_csv(raw_data_path)
    #     print(f"[INFO] Dataset loaded from {raw_data_path}. Shape: {df.shape}")
    # except Exception as e:
    #     print(f"[ERROR] Failed to load dataset: {e}")
    #     return

    # # Estimate an optimal window size from the target column (optional)
    # target_col = config["features"]["y_variable"][0]
    # optimal_window = estimate_optimal_window_size(df[target_col], threshold=0.5, max_window=100)
    # print("[INFO] Estimated optimal window size:", optimal_window)
    
    # # -------------------------------
    # # Example 1: Trial Segmentation (Per-Trial) using Padding (DTW disabled)
    # # -------------------------------
    # ts_params_trial = ts_params.copy()
    # ts_params_trial["use_dtw"] = False  # Disable DTW for simple padding
    # print("\n[INFO] Running Trial Segmentation Example (per-trial segmentation with DTW disabled)...")
    
    # preprocessor_trial = DataPreprocessor(
    #     model_type="LSTM",  # triggers the time_series branch
    #     y_variable=config["features"]["y_variable"],
    #     ordinal_categoricals=config["features"].get("ordinal_categoricals", []),
    #     nominal_categoricals=config["features"].get("nominal_categoricals", []),
    #     numericals=[col for col in config["features"]["numericals"] if col != target_col],  # exclude target and time column
    #     mode="train",
    #     options=ts_params_trial,
    #     debug=True,
    #     graphs_output_dir=str(Path(config["paths"]["plots_output_dir"])),
    #     transformers_dir=str(Path(config["paths"]["transformers_save_base_dir"])),
    #     time_column=ts_params_trial.get("time_column", "frame_time"),
    #     window_size=ts_params_trial.get("window_size", optimal_window),
    #     horizon=ts_params_trial.get("horizon", 1),
    #     step_size=ts_params_trial.get("step_size", 1),
    #     max_sequence_length=ts_params_trial.get("max_sequence_length", optimal_window),
    #     use_dtw=ts_params_trial["use_dtw"],
    #     sequence_categorical=["trial_id"]  # Grouping by trial identifier
    # )
    
    # try:
    #     X_seq_trial, _, y_seq_trial, _, rec_trial, _ = preprocessor_trial.final_preprocessing(df)
    #     print("[INFO] Trial Segmentation Example complete.")
    #     print(f"[INFO] X_seq (trial segmentation) shape: {X_seq_trial.shape}")
    #     print(f"[INFO] y_seq (trial segmentation) shape: {y_seq_trial.shape}")
    #     print("[INFO] Preprocessing recommendations (trial segmentation):")
    #     print(rec_trial)
    # except Exception as e:
    #     print(f"[ERROR] Trial Segmentation Example failed: {e}")
    
    # -------------------------------
    # Example 2: Whole-Session Segmentation using Sliding Window (DTW enabled)
    # -------------------------------
    # ts_params_session = ts_params.copy()
    # ts_params_session["use_dtw"] = True  # Enable DTW alignment
    # print("\n[INFO] Running Whole-Session Segmentation Example (sliding window with DTW enabled)...")
    
    # preprocessor_session = DataPreprocessor(
    #     model_type="LSTM",
    #     y_variable=config["features"]["y_variable"],
    #     ordinal_categoricals=config["features"].get("ordinal_categoricals", []),
    #     nominal_categoricals=config["features"].get("nominal_categoricals", []),
    #     numericals=[col for col in config["features"]["numericals"] if col != target_col],
    #     mode="train",
    #     options=ts_params_session,
    #     debug=True,
    #     graphs_output_dir=str(Path(config["paths"]["plots_output_dir"])),
    #     transformers_dir=str(Path(config["paths"]["transformers_save_base_dir"])),
    #     time_column=ts_params_session.get("time_column", "frame_time"),
    #     window_size=ts_params_session.get("window_size", optimal_window),
    #     horizon=ts_params_session.get("horizon", 1),
    #     step_size=ts_params_session.get("step_size", 1),
    #     max_sequence_length=ts_params_session.get("max_sequence_length", optimal_window),
    #     use_dtw=ts_params_session["use_dtw"],
    #     sequence_categorical=None  # Use sliding window segmentation
    # )
    
    # try:
    #     X_seq_session, _, y_seq_session, _, rec_session, _ = preprocessor_session.final_preprocessing(df)
    #     print("[INFO] Whole-Session Segmentation Example complete.")
    #     print(f"[INFO] X_seq (session segmentation) shape: {X_seq_session.shape}")
    #     print(f"[INFO] y_seq (session segmentation) shape: {y_seq_session.shape}")
    #     print("[INFO] Preprocessing recommendations (session segmentation):")
    #     print(rec_session)
    # except Exception as e:
    #     print(f"[ERROR] Whole-Session Segmentation Example failed: {e}")
    
    # -------------------------------
    # Example 3: Shooting Motion Segmentation with DTW Enabled
    # -------------------------------
    # In this new example, we use "shooting_motion" as the grouping (categorical) variable.
    # This will test the pipeline's ability to group sequences based on shooting motion,
    # while DTW alignment is enabled.
    # ts_params_shooting = ts_params.copy()
    # ts_params_shooting["use_dtw"] = True  # Enable DTW alignment
    # print("\n[INFO] Running Shooting Motion Segmentation Example (grouping by shooting_motion with DTW enabled)...")
    
    # preprocessor_shooting = DataPreprocessor(
    #     model_type="LSTM",
    #     y_variable=config["features"]["y_variable"],
    #     ordinal_categoricals=config["features"].get("ordinal_categoricals", []),
    #     nominal_categoricals=config["features"].get("nominal_categoricals", []),
    #     numericals=[col for col in config["features"]["numericals"] if col != target_col],  # exclude target and time column
    #     mode="train",
    #     options=ts_params_shooting,
    #     debug=True,
    #     graphs_output_dir=str(Path(config["paths"]["plots_output_dir"])),
    #     transformers_dir=str(Path(config["paths"]["transformers_save_base_dir"])),
    #     time_column=ts_params_shooting.get("time_column", "frame_time"),
    #     window_size=ts_params_shooting.get("window_size", optimal_window),
    #     horizon=ts_params_shooting.get("horizon", 1),
    #     step_size=ts_params_shooting.get("step_size", 1),
    #     max_sequence_length=ts_params_shooting.get("max_sequence_length", optimal_window),
    #     use_dtw=ts_params_shooting["use_dtw"],
    #     sequence_categorical=["trial_id", "shooting_motion"]  # Grouping by shooting motion
    # )
    
    # try:
    #     X_seq_shooting, _, y_seq_shooting, _, rec_shooting, _ = preprocessor_shooting.final_preprocessing(df)
    #     print("[INFO] Shooting Motion Segmentation Example complete.")
    #     print(f"[INFO] X_seq (shooting motion segmentation) shape: {X_seq_shooting.shape}")
    #     print(f"[INFO] y_seq (shooting motion segmentation) shape: {y_seq_shooting.shape}")
    #     print("[INFO] Preprocessing recommendations (shooting motion segmentation):")
    #     print(rec_shooting)
    # except Exception as e:
    #     print(f"[ERROR] Shooting Motion Segmentation Example failed: {e}")




def main_time_series():
    # Load configuration from the updated YAML file
    config_path = Path("../../dataset/test/preprocessor_config/preprocessor_config_baseball.yaml")
    config = load_config(config_path)
    
    # Extract time series parameters from the config
    ts_params = config.get("time_series", {})
    if not ts_params.get("enabled", False):
        print("[INFO] Time series processing is not enabled in the config.")
        return

    # Set dataset path based on config
    data_dir = Path(config["paths"]["data_dir"])
    raw_data_file = config["paths"]["raw_data"]
    raw_data_path = data_dir / raw_data_file
    
    # Load dataset
    try:
        df = pd.read_parquet(raw_data_path)
        print(f"[INFO] Dataset loaded from {raw_data_path}. Shape: {df.shape}")
    except Exception as e:
        print(f"[ERROR] Failed to load dataset: {e}")
        return

    # Estimate an optimal window size from the target column (optional)
    target_col = config["features"]["y_variable"][0]
    optimal_window = estimate_optimal_window_size(df[target_col], threshold=0.5, max_window=100)
    print("[INFO] Estimated optimal window size:", optimal_window)
    
    # -------------------------------
    # Example 1: Trial Segmentation (Per-Trial) using Padding (DTW disabled)
    # -------------------------------
    ts_params_trial = ts_params.copy()
    ts_params_trial["use_dtw"] = False  # Disable DTW for simple padding
    print("\n[INFO] Running Trial Segmentation Example (per-trial segmentation with DTW disabled)...")
    
    # preprocessor_trial = DataPreprocessor(
    #     model_type="LSTM",  # triggers the time_series branch
    #     y_variable=config["features"]["y_variable"],
    #     ordinal_categoricals=config["features"].get("ordinal_categoricals", []),
    #     nominal_categoricals=config["features"].get("nominal_categoricals", []),
    #     numericals=[col for col in config["features"]["numericals"] if col != target_col],  # exclude target and time column
    #     mode="train",
    #     options=ts_params_trial,
    #     debug=True,
    #     graphs_output_dir=str(Path(config["paths"]["plots_output_dir"])),
    #     transformers_dir=str(Path(config["paths"]["transformers_save_base_dir"])),
    #     time_column=ts_params_trial.get("time_column", "frame_time"),
    #     window_size=ts_params_trial.get("window_size", optimal_window),
    #     horizon=ts_params_trial.get("horizon", 1),
    #     step_size=ts_params_trial.get("step_size", 1),
    #     max_sequence_length=ts_params_trial.get("max_sequence_length", optimal_window),
    #     use_dtw=ts_params_trial["use_dtw"],
    #     sequence_categorical=["trial_id"]  # Grouping by trial identifier
    # )
    
    # try:
    #     X_seq_trial, _, y_seq_trial, _, rec_trial, _ = preprocessor_trial.final_preprocessing(df)
    #     print("[INFO] Trial Segmentation Example complete.")
    #     print(f"[INFO] X_seq (trial segmentation) shape: {X_seq_trial.shape}")
    #     print(f"[INFO] y_seq (trial segmentation) shape: {y_seq_trial.shape}")
    #     print("[INFO] Preprocessing recommendations (trial segmentation):")
    #     print(rec_trial)
    # except Exception as e:
    #     print(f"[ERROR] Trial Segmentation Example failed: {e}")
    
    # -------------------------------
    # Example 2: Whole-Session Segmentation using Sliding Window (DTW enabled)
    # -------------------------------
    ts_params_session = ts_params.copy()
    ts_params_session["use_dtw"] = True  # Enable DTW alignment
    print("\n[INFO] Running Whole-Session Segmentation Example (sliding window with DTW enabled)...")
    
    # preprocessor_session = DataPreprocessor(
    #     model_type="LSTM",
    #     y_variable=config["features"]["y_variable"],
    #     ordinal_categoricals=config["features"].get("ordinal_categoricals", []),
    #     nominal_categoricals=config["features"].get("nominal_categoricals", []),
    #     numericals=[col for col in config["features"]["numericals"] if col != target_col],
    #     mode="train",
    #     options=ts_params_session,
    #     debug=True,
    #     graphs_output_dir=str(Path(config["paths"]["plots_output_dir"])),
    #     transformers_dir=str(Path(config["paths"]["transformers_save_base_dir"])),
    #     time_column=ts_params_session.get("time_column", "frame_time"),
    #     window_size=ts_params_session.get("window_size", optimal_window),
    #     horizon=ts_params_session.get("horizon", 1),
    #     step_size=ts_params_session.get("step_size", 1),
    #     max_sequence_length=ts_params_session.get("max_sequence_length", optimal_window),
    #     use_dtw=ts_params_session["use_dtw"],
    #     sequence_categorical=None  # Use sliding window segmentation
    # )
    
    # try:
    #     X_seq_session, _, y_seq_session, _, rec_session, _ = preprocessor_session.final_preprocessing(df)
    #     print("[INFO] Whole-Session Segmentation Example complete.")
    #     print(f"[INFO] X_seq (session segmentation) shape: {X_seq_session.shape}")
    #     print(f"[INFO] y_seq (session segmentation) shape: {y_seq_session.shape}")
    #     print("[INFO] Preprocessing recommendations (session segmentation):")
    #     print(rec_session)
    # except Exception as e:
    #     print(f"[ERROR] Whole-Session Segmentation Example failed: {e}")
    
    # -------------------------------
    # Example 3: Shooting Motion Segmentation with DTW Enabled
    # -------------------------------
    # In this new example, we use "shooting_motion" as the grouping (categorical) variable.
    # This will test the pipeline's ability to group sequences based on shooting motion,
    # while DTW alignment is enabled.
    ts_params_shooting = ts_params.copy()
    ts_params_shooting["use_dtw"] = True  # Enable DTW alignment
    print("\n[INFO] Running Shooting Motion Segmentation Example (grouping by shooting_motion with DTW enabled)...")
    
    preprocessor_shooting = DataPreprocessor(
        model_type="LSTM",
        y_variable=config["features"]["y_variable"],
        ordinal_categoricals=config["features"].get("ordinal_categoricals", []),
        nominal_categoricals=config["features"].get("nominal_categoricals", []),
        numericals=[col for col in config["features"]["numericals"] if col != target_col],  # exclude target and time column
        mode="train",
        options=ts_params_shooting,
        debug=True,
        graphs_output_dir=str(Path(config["paths"]["plots_output_dir"])),
        transformers_dir=str(Path(config["paths"]["transformers_save_base_dir"])),
        time_column=ts_params_shooting.get("time_column", "ongoing_timestamp_biomech"),
        window_size=ts_params_shooting.get("window_size", optimal_window),
        horizon=ts_params_shooting.get("horizon", 1),
        step_size=ts_params_shooting.get("step_size", 1),
        max_sequence_length=ts_params_shooting.get("max_sequence_length", optimal_window),
        use_dtw=ts_params_shooting["use_dtw"], # Do not use DTW, so sequences remain at their natural length
        sequence_categorical=["session_biomech", "pitch_phase_biomech"],  # Grouping by shooting motion
        dynamic_window_adjustment=False  # Do not Preserve natural sequence lengths (no dtw or padding)
    )
    try:
        X_seq_shooting, _, y_seq_shooting, _, rec_shooting, _ = preprocessor_shooting.final_preprocessing(df)
        print("[INFO] Shooting Motion Segmentation Example complete.")
        print(f"[INFO] X_seq (shooting motion segmentation) shape: {X_seq_shooting.shape}")
        print(f"[INFO] y_seq (shooting motion segmentation) shape: {y_seq_shooting.shape}")
        print("[INFO] Preprocessing recommendations (shooting motion segmentation):")
        print(rec_shooting)
    except Exception as e:
        print(f"[ERROR] Shooting Motion Segmentation Example failed: {e}")

if __name__ == "__main__":
    print("Running updated time series preprocessing main function...")
    main_time_series()



Running updated time series preprocessing main function...


2025-02-20 18:12:04,425 [INFO] Starting: Final Preprocessing Pipeline in 'train' mode.
2025-02-20 18:12:04,425 [INFO] Step: filter_columns
2025-02-20 18:12:04,426 [DEBUG] y_variable provided: ['result']
2025-02-20 18:12:04,428 [DEBUG] Unique values in target column(s): {'result': {0: 0, 117: 1}}
2025-02-20 18:12:04,429 [INFO] ✅ Filtered DataFrame to include only specified features. Shape: (16047, 25)
2025-02-20 18:12:04,430 [DEBUG] Selected Features: ['landing_x', 'landing_y', 'entry_angle', 'ball_x', 'ball_y', 'ball_z', 'R_EYE_x', 'R_EYE_y', 'R_EYE_z', 'L_EYE_x', 'L_EYE_y', 'L_EYE_z', 'NOSE_x', 'NOSE_y', 'NOSE_z', 'R_SHOULDER_x', 'R_SHOULDER_y', 'R_SHOULDER_z', 'L_SHOULDER_x', 'L_SHOULDER_y', 'L_SHOULDER_z', 'shooting_motion', 'trial_id', 'frame_time']
2025-02-20 18:12:04,430 [DEBUG] Retained Target Variable(s): ['result']
2025-02-20 18:12:04,431 [INFO] ✅ Column filtering completed successfully.
2025-02-20 18:12:04,431 [INFO] Step: Handle Missing Values
2025-02-20 18:12:04,461 [INFO] 

[INFO] Dataset loaded from ..\..\dataset\test\data\final_granular_dataset.csv. Shape: (16047, 214)
[INFO] Estimated optimal window size: 68

[INFO] Running Trial Segmentation Example (per-trial segmentation with DTW disabled)...

[INFO] Running Whole-Session Segmentation Example (sliding window with DTW enabled)...

[INFO] Running Shooting Motion Segmentation Example (grouping by shooting_motion with DTW enabled)...


2025-02-20 18:12:04,583 [DEBUG] Replaced 15 outliers in column 'NOSE_x' with rolling median.
2025-02-20 18:12:04,592 [DEBUG] Replaced 6 outliers in column 'NOSE_y' with rolling median.
2025-02-20 18:12:04,603 [DEBUG] Replaced 35 outliers in column 'NOSE_z' with rolling median.
2025-02-20 18:12:04,613 [DEBUG] Replaced 41 outliers in column 'R_SHOULDER_x' with rolling median.
2025-02-20 18:12:04,623 [DEBUG] Replaced 9 outliers in column 'R_SHOULDER_y' with rolling median.
2025-02-20 18:12:04,632 [DEBUG] Replaced 52 outliers in column 'R_SHOULDER_z' with rolling median.
2025-02-20 18:12:04,642 [DEBUG] Replaced 47 outliers in column 'L_SHOULDER_x' with rolling median.
2025-02-20 18:12:04,651 [DEBUG] Replaced 29 outliers in column 'L_SHOULDER_y' with rolling median.
2025-02-20 18:12:04,660 [DEBUG] Replaced 52 outliers in column 'L_SHOULDER_z' with rolling median.
2025-02-20 18:12:04,666 [DEBUG] Columns after sorting: ['landing_x', 'landing_y', 'entry_angle', 'ball_x', 'ball_y', 'ball_z', 'R

[INFO] Shooting Motion Segmentation Example complete.
[INFO] X_seq (shooting motion segmentation) shape: (125, 192, 149)
[INFO] y_seq (shooting motion segmentation) shape: (125, 192, 1)
[INFO] Preprocessing recommendations (shooting motion segmentation):
                                  Preprocessing Reason
shooting_motion  Categorical: Most_frequent Imputation
trial_id         Categorical: Most_frequent Imputation
landing_x                 Numerical: Median Imputation
landing_y                 Numerical: Median Imputation
entry_angle               Numerical: Median Imputation
ball_x                    Numerical: Median Imputation
ball_y                    Numerical: Median Imputation
ball_z                    Numerical: Median Imputation
R_EYE_x                   Numerical: Median Imputation
R_EYE_y                   Numerical: Median Imputation
R_EYE_z                   Numerical: Median Imputation
L_EYE_x                   Numerical: Median Imputation
L_EYE_y                   Nume

In [39]:
# Load and explore the dataset
import pandas as pd
import numpy as np

# Load the parquet file
df = pd.read_parquet("../../dataset/test/data/final_inner_join_emg_biomech_data.parquet")

# Function to analyze column types and unique values
def analyze_columns(df):
    analysis = {}
    for col in df.columns:
        dtype = df[col].dtype
        n_unique = df[col].nunique()
        unique_values = df[col].unique()
        
        # Sample of unique values (limit to 10 for display)
        sample_values = unique_values[:10] if len(unique_values) > 10 else unique_values
        
        analysis[col] = {
            'dtype': str(dtype),
            'n_unique': n_unique,
            'sample_values': sample_values
        }
    return analysis

# Perform analysis
column_analysis = analyze_columns(df)

# Print results in organized sections
print("Dataset Analysis:")
print(f"Total number of rows: {len(df)}")
print(f"Total number of columns: {len(df.columns)}")
print("\nColumn Analysis:")

for col, info in column_analysis.items():
    print(f"\n{col}:")
    print(f"  Data type: {info['dtype']}")
    print(f"  Number of unique values: {info['n_unique']}")
    print(f"  Sample values: {info['sample_values']}")
    
# Suggest column categorization
suggested_categorization = {
    'nominal_categorical': [],
    'ordinal_categorical': [],
    'numerical': []
}

for col, info in column_analysis.items():
    if info['dtype'] in ['object', 'string', 'category']:
        if info['n_unique'] < 20:  # Arbitrary threshold for categorical variables
            suggested_categorization['nominal_categorical'].append(col)
    elif 'int' in info['dtype'] or 'float' in info['dtype']:
        if info['n_unique'] < 10:  # Small number of numeric values might indicate ordinal
            suggested_categorization['ordinal_categorical'].append(col)
        else:
            suggested_categorization['numerical'].append(col)

# print("\nSuggested Variable Categorization:")
# print("\nNominal Categorical Variables:")
# print(suggested_categorization['nominal_categorical'])
# print("\nOrdinal Categorical Variables:")
# print(suggested_categorization['ordinal_categorical'])
# print("\nNumerical Variables:")
# print(suggested_categorization['numerical'])




def main_time_series():
    # Load configuration from the updated YAML file
    config_path = Path("../../dataset/test/preprocessor_config/preprocessor_config_baseball.yaml")
    config = load_config(config_path)
    
    # Extract time series parameters from the config
    ts_params = config.get("time_series", {})
    if not ts_params.get("enabled", False):
        print("[INFO] Time series processing is not enabled in the config.")
        return

    # Set dataset path based on config
    data_dir = Path(config["paths"]["data_dir"])
    raw_data_file = config["paths"]["raw_data"]
    raw_data_path = data_dir / raw_data_file
    
    # Load dataset
    try:
        df = pd.read_csv(raw_data_path)
        print(f"[INFO] Dataset loaded from {raw_data_path}. Shape: {df.shape}")
    except Exception as e:
        print(f"[ERROR] Failed to load dataset: {e}")
        return

    # Estimate an optimal window size from the target column (optional)
    target_col = config["features"]["y_variable"][0]
    optimal_window = estimate_optimal_window_size(df[target_col], threshold=0.5, max_window=100)
    print("[INFO] Estimated optimal window size:", optimal_window)
    
    # -------------------------------
    # Example 1: Trial Segmentation (Per-Trial) using Padding (DTW disabled)
    # -------------------------------
    ts_params_trial = ts_params.copy()
    ts_params_trial["use_dtw"] = False  # Disable DTW for simple padding
    print("\n[INFO] Running Trial Segmentation Example (per-trial segmentation with DTW disabled)...")
    
    # preprocessor_trial = DataPreprocessor(
    #     model_type="LSTM",  # triggers the time_series branch
    #     y_variable=config["features"]["y_variable"],
    #     ordinal_categoricals=config["features"].get("ordinal_categoricals", []),
    #     nominal_categoricals=config["features"].get("nominal_categoricals", []),
    #     numericals=[col for col in config["features"]["numericals"] if col != target_col],  # exclude target and time column
    #     mode="train",
    #     options=ts_params_trial,
    #     debug=True,
    #     graphs_output_dir=str(Path(config["paths"]["plots_output_dir"])),
    #     transformers_dir=str(Path(config["paths"]["transformers_save_base_dir"])),
    #     time_column=ts_params_trial.get("time_column", "frame_time"),
    #     window_size=ts_params_trial.get("window_size", optimal_window),
    #     horizon=ts_params_trial.get("horizon", 1),
    #     step_size=ts_params_trial.get("step_size", 1),
    #     max_sequence_length=ts_params_trial.get("max_sequence_length", optimal_window),
    #     use_dtw=ts_params_trial["use_dtw"],
    #     sequence_categorical=["trial_id"]  # Grouping by trial identifier
    # )
    
    # try:
    #     X_seq_trial, _, y_seq_trial, _, rec_trial, _ = preprocessor_trial.final_preprocessing(df)
    #     print("[INFO] Trial Segmentation Example complete.")
    #     print(f"[INFO] X_seq (trial segmentation) shape: {X_seq_trial.shape}")
    #     print(f"[INFO] y_seq (trial segmentation) shape: {y_seq_trial.shape}")
    #     print("[INFO] Preprocessing recommendations (trial segmentation):")
    #     print(rec_trial)
    # except Exception as e:
    #     print(f"[ERROR] Trial Segmentation Example failed: {e}")
    
    # -------------------------------
    # Example 2: Whole-Session Segmentation using Sliding Window (DTW enabled)
    # -------------------------------
    ts_params_session = ts_params.copy()
    ts_params_session["use_dtw"] = True  # Enable DTW alignment
    print("\n[INFO] Running Whole-Session Segmentation Example (sliding window with DTW enabled)...")
    
    # preprocessor_session = DataPreprocessor(
    #     model_type="LSTM",
    #     y_variable=config["features"]["y_variable"],
    #     ordinal_categoricals=config["features"].get("ordinal_categoricals", []),
    #     nominal_categoricals=config["features"].get("nominal_categoricals", []),
    #     numericals=[col for col in config["features"]["numericals"] if col != target_col],
    #     mode="train",
    #     options=ts_params_session,
    #     debug=True,
    #     graphs_output_dir=str(Path(config["paths"]["plots_output_dir"])),
    #     transformers_dir=str(Path(config["paths"]["transformers_save_base_dir"])),
    #     time_column=ts_params_session.get("time_column", "frame_time"),
    #     window_size=ts_params_session.get("window_size", optimal_window),
    #     horizon=ts_params_session.get("horizon", 1),
    #     step_size=ts_params_session.get("step_size", 1),
    #     max_sequence_length=ts_params_session.get("max_sequence_length", optimal_window),
    #     use_dtw=ts_params_session["use_dtw"],
    #     sequence_categorical=None  # Use sliding window segmentation
    # )
    
    # try:
    #     X_seq_session, _, y_seq_session, _, rec_session, _ = preprocessor_session.final_preprocessing(df)
    #     print("[INFO] Whole-Session Segmentation Example complete.")
    #     print(f"[INFO] X_seq (session segmentation) shape: {X_seq_session.shape}")
    #     print(f"[INFO] y_seq (session segmentation) shape: {y_seq_session.shape}")
    #     print("[INFO] Preprocessing recommendations (session segmentation):")
    #     print(rec_session)
    # except Exception as e:
    #     print(f"[ERROR] Whole-Session Segmentation Example failed: {e}")
    
    # -------------------------------
    # Example 3: Shooting Motion Segmentation with DTW Enabled
    # -------------------------------
    # In this new example, we use "shooting_motion" as the grouping (categorical) variable.
    # This will test the pipeline's ability to group sequences based on shooting motion,
    # while DTW alignment is enabled.
    ts_params_shooting = ts_params.copy()
    ts_params_shooting["use_dtw"] = True  # Enable DTW alignment
    print("\n[INFO] Running Shooting Motion Segmentation Example (grouping by shooting_motion with DTW enabled)...")
    
    preprocessor_shooting = DataPreprocessor(
        model_type="LSTM",
        y_variable=config["features"]["y_variable"],
        ordinal_categoricals=config["features"].get("ordinal_categoricals", []),
        nominal_categoricals=config["features"].get("nominal_categoricals", []),
        numericals=[col for col in config["features"]["numericals"] if col != target_col],  # exclude target and time column
        mode="train",
        options=ts_params_shooting,
        debug=True,
        graphs_output_dir=str(Path(config["paths"]["plots_output_dir"])),
        transformers_dir=str(Path(config["paths"]["transformers_save_base_dir"])),
        time_column=ts_params_shooting.get("time_column", "frame_time"),
        window_size=ts_params_shooting.get("window_size", optimal_window),
        horizon=ts_params_shooting.get("horizon", 1),
        step_size=ts_params_shooting.get("step_size", 1),
        max_sequence_length=ts_params_shooting.get("max_sequence_length", optimal_window),
        use_dtw=ts_params_shooting["use_dtw"],
        sequence_categorical=["trial_biomech", "pitch_phase_biomech"]  # Grouping by shooting motion
    )
    
    try:
        X_seq_shooting, _, y_seq_shooting, _, rec_shooting, _ = preprocessor_shooting.final_preprocessing(df)
        print("[INFO] Shooting Motion Segmentation Example complete.")
        print(f"[INFO] X_seq (shooting motion segmentation) shape: {X_seq_shooting.shape}")
        print(f"[INFO] y_seq (shooting motion segmentation) shape: {y_seq_shooting.shape}")
        print("[INFO] Preprocessing recommendations (shooting motion segmentation):")
        print(rec_shooting)
    except Exception as e:
        print(f"[ERROR] Shooting Motion Segmentation Example failed: {e}")

if __name__ == "__main__":
    print("Running updated time series preprocessing main function...")
    main_time_series()

Dataset Analysis:
Total number of rows: 134720
Total number of columns: 100

Column Analysis:

EMG 1 (mV) - FDS (81770):
  Data type: float64
  Number of unique values: 4767
  Sample values: [-0.0391089 -0.0345769 -0.0142672 -0.0157778 -0.0448157 -0.0827497
 -0.0948348 -0.0384375  0.0186313  0.0642863]

ACC X (G) - FDS (81770):
  Data type: float64
  Number of unique values: 27508
  Sample values: [-0.9207153 -0.9168091 -0.8994141 -0.8842773 -0.8800659 -0.8961792
 -0.8947754 -0.8815308 -0.8793335 -0.8860474]

ACC Y (G) - FDS (81770):
  Data type: float64
  Number of unique values: 36905
  Sample values: [0.406311  0.4068604 0.4025269 0.4116821 0.4137573 0.4003296 0.3912354
 0.37854   0.3707275 0.3747559]

ACC Z (G) - FDS (81770):
  Data type: float64
  Number of unique values: 26191
  Sample values: [-0.3459473 -0.3414917 -0.3353271 -0.333252  -0.3372192 -0.3606567
 -0.3652344 -0.3630981 -0.3761597 -0.3837891]

GYRO X (deg/s) - FDS (81770):
  Data type: float64
  Number of unique value

KeyError: 'elbow_energy_generation_biomech'