# Forecasting Pitcher Injury Risk Throughout Sessions Using Time Series Methods

The question of how to forecast pitcher injury risk throughout and across pitching sessions represents an advanced application of time series forecasting. By leveraging the sequence processing capabilities in the provided DataPreprocessor class, we can develop models that not only assess current injury risk but predict how risk profiles might evolve through continued pitching. This report details how to implement such a system using the various preprocessing modes available.

## Understanding Pitcher Biomechanics as Time Series Data

Pitcher motion naturally segments into distinct biomechanical phases, which makes it an ideal candidate for time series analysis. Looking at the code, we can see explicit references to pitching phases in the DataPreprocessor class:

```python
predefined_order = ["windup", "arm_cocking", "arm_acceleration", "follow_through"]
```

This structure allows us to model each pitch as a multivariate time series sequence flowing through these mechanical phases. For injury forecasting, we need to consider both:

1. Intra-pitch dynamics - biomechanical patterns within a single pitch
2. Inter-pitch evolution - how patterns change across consecutive pitches

### Data Structure for Pitcher Monitoring

For effective injury risk modeling, we would organize pitching data hierarchically:

- Each pitcher has multiple sessions
- Each session contains multiple pitches
- Each pitch consists of sensor measurements across multiple phases
- Each phase contains multiple timesteps with biomechanical measurements

The DataPreprocessor supports this exact structure through its primary and secondary grouping parameters:

```python
sequence_categorical=["pitcher_id", "session_id"]  # Group by pitcher and session
sequence_dtw_or_pad_categorical=["phase"]  # Sub-divide by pitching phase
```

## Selecting the Optimal Preprocessing Mode

The DataPreprocessor offers three sequence handling modes, each with different advantages for injury forecasting:

### Dynamic Time Warping (DTW) Mode

The DTW mode is particularly well-suited for pitcher injury forecasting for several reasons:

```python
self.time_series_sequence_mode = "dtw"
```

DTW aligns sequences by warping the time dimension, which addresses a critical challenge in biomechanical analysis - pitchers naturally vary their timing between pitches, especially as fatigue sets in. DTW correctly aligns the biomechanical events rather than the clock time, ensuring that the arm_cocking phase from pitch #1 is properly compared to the arm_cocking phase of pitch #50, despite potential differences in duration.

From the code implementation:

```python
if self.time_series_sequence_mode == "dtw":
    alignment_path = dtw_path(phase_array, alignment_path, target_length)
    aligned_seq = warp_sequence(phase_array, alignment_path, target_length)
```

This alignment preserves the crucial biomechanical relationships while standardizing sequence lengths for model input.

### Padding Mode

Padding offers a simpler alternative when the timing variations between pitches are minimal:

```python
# In pad mode
aligned_seq = self.pad_sequence(phase_array, target_length)
```

This approach is computationally less expensive than DTW while still enabling comparison across pitches of different lengths. It works well when the pitching motion is highly consistent in timing, but less effectively when a pitcher significantly alters their timing due to fatigue.

### Sliding Window Mode

For monitoring evolving patterns across consecutive pitches, the set_window mode becomes valuable:

```python
# From create_sequences method
for i in range(0, len(X) - self.window_size - self.horizon + 1, self.step_size):
    seq_X = X[i:i+self.window_size]
    seq_y = y[i+self.window_size:i+self.window_size+self.horizon]
```

This approach allows us to capture how biomechanical patterns drift over sequences of pitches - a critical factor in injury development, as most pitching injuries result from cumulative strain rather than acute events.

## Forecasting Complete Pitching Sessions

Now addressing your specific question about forecasting full pitching sessions based on previous sequences - yes, this is achievable through recursive prediction approaches.

### Implementation Strategy for Full Session Forecasting

To forecast a complete pitching session, we would:

1. Train two specialized models:
   - A mechanics evolution model that predicts how biomechanics change from pitch to pitch
   - An injury risk model that translates mechanics into injury probability

2. Apply these models recursively to generate a session-long forecast:

```python
def forecast_session(initial_pitches, num_future_pitches=50):
    # Process initial pitches using DataPreprocessor
    processed_data = preprocessor.process_dtw_or_pad(initial_pitches)
    
    # Initialize with observed pitches
    current_sequence = processed_data.copy()
    forecasted_mechanics = [current_sequence]
    injury_risks = []
    
    # Generate predictions for future pitches
    for i in range(num_future_pitches):
        # Predict mechanics for next pitch
        next_mechanics = mechanics_model.predict(current_sequence)
        
        # Update sequence for next prediction
        current_sequence = update_sequence(current_sequence, next_mechanics)
        
        # Predict injury risk for this new mechanical pattern
        risk = injury_model.predict(current_sequence)
        
        # Store predictions
        forecasted_mechanics.append(current_sequence.copy())
        injury_risks.append(risk)
    
    return forecasted_mechanics, injury_risks
```

This approach enables forecasting how a pitcher's mechanics and injury risk will evolve throughout a complete session.

## Forecasting Across Multiple Sessions

For forecasting across multiple pitching sessions, we extend the model to incorporate recovery patterns:

### Between-Session Recovery Modeling

The horizon parameter in the DataPreprocessor becomes particularly relevant:

```python
self.horizon = time_series_config.get('horizon')
```

By setting the horizon to represent future sessions rather than future pitches, we can predict how a pitcher's baseline mechanics might change from session to session based on workload, recovery time, and previous mechanical patterns[1].

For multi-session forecasting:

1. Use session-aggregated features (e.g., average mechanics, fatigue progression rate)
2. Incorporate recovery metrics (days rest, treatment interventions)
3. Apply the horizon parameter to predict multiple sessions ahead:

```python
# Configure for multi-session forecasting
options = {
    "horizon": 3,  # Predict 3 sessions ahead
    "sequence_modes": {
        "dtw": {
            "use_dtw": True,
            "reference_sequence": "max"
        }
    }
}
```

This configuration allows the model to predict how a pitcher's performance and injury risk will evolve over their next three pitching sessions.

## Practical Implementation Considerations

### Feature Engineering for Injury Risk

Effective injury forecasting requires carefully selected features:

1. **Biomechanical features**: Joint angles, velocities, and accelerations
2. **Workload metrics**: Pitch count, intensity, and cumulative load
3. **Recovery indicators**: Rest days, subjective fatigue ratings
4. **Historical injury patterns**: Previous injury locations and rehabilitation status

### Training Data Requirements

The DataPreprocessor's handling of target variables is particularly relevant:

```python
if self.horizon == 1:
    y_seq.append(y_group[-1])
else:
    y_seq.append(y_group[-self.horizon:].squeeze())
```

For injury forecasting, the target variable could be:
- Binary classification: Injury occurrence (yes/no)
- Regression: Injury risk score (0-100%)
- Multi-class: Specific injury type prediction

Each approach requires different amounts of labeled training data.

## Practical Challenges and Limitations

Several challenges must be addressed in implementing this forecasting approach:

### Error Propagation in Recursive Forecasting

The recursive nature of full session forecasting means that errors in early predictions propagate to later predictions. The code doesn't explicitly handle this, but approaches like ensemble forecasting could help mitigate this issue by generating multiple potential trajectories[1].

### Individual Variation in Pitching Mechanics

The DataPreprocessor's grouping capabilities are crucial for handling individual variation:

```python
grouped = self._group_top_level(data)
```

This allows for pitcher-specific models that account for unique biomechanical signatures, but requires sufficient data for each individual pitcher.

### Real-time Deployment Considerations

For in-game application, latency becomes critical. The preprocessing pipeline must be optimized for speed:

```python
# Streamlined prediction pathway
def streamlined_prediction(new_pitch_data):
    processed = preprocessor.preprocess_predict_time_series(new_pitch_data)
    return injury_model.predict(processed)
```

This enables real-time risk assessment as pitches occur, allowing for immediate intervention when risk thresholds are exceeded.

## Conclusion

The time series preprocessing methods in the DataPreprocessor class provide a powerful foundation for modeling and forecasting pitcher injury risk. By combining DTW for phase alignment with sliding windows for capturing mechanical drift patterns, we can build comprehensive models that forecast injury risk throughout single sessions and across multiple sessions.

The ability to forecast full sequences enables proactive injury prevention by identifying risky mechanical patterns before they lead to injury. While challenges remain in terms of data requirements and error propagation, the framework outlined here represents a significant advance in pitcher health monitoring and could substantially reduce injury rates through early intervention.

Citations:
[1] https://ppl-ai-file-upload.s3.amazonaws.com/web/direct-files/53658542/f03fbcc0-0ca4-4fa3-877f-10948aa60a6d/paste.txt

---
Answer from Perplexity: pplx.ai/share

# splitting time series with sequeencs in mind:

Can we make this easier on ourselves by just using the sequence_categorical and the time_column to help us split these in different conditions?  Start by getting the count of the dataset and count of each categorical metrics within that so we can use that for the split percentage so if we used sessions and we have 10 sessions within the dataset and we wanted to split it 80/20 we would then it would be 8 train sessions and 2 test sessions ensuring we still keep the sequence correctly based on the time_column and that if the seqeuences are perfectly splittable then we shoould get the closest before and after values for suggestion and go with the closest option, and if we wanted to split by a specific date based on the time column we should still get the count of the sequence_categorical and then get the beginning time_column from each so we can use the one after and before as suggestions for the split and use the closest one to split them. This way the sequences are done correctly. 




In [34]:
# datapreprocessor.py

import pandas as pd
import numpy as np
from typing import List, Dict, Optional, Tuple, Any
import logging
import os
from sklearn.model_selection import train_test_split
from sklearn.impute import SimpleImputer, KNNImputer
from sklearn.preprocessing import PowerTransformer, OrdinalEncoder, OneHotEncoder, StandardScaler, MinMaxScaler, RobustScaler
from sklearn.compose import ColumnTransformer
from sklearn.pipeline import Pipeline
from sklearn.metrics import mean_squared_error
from imblearn.over_sampling import SMOTE, ADASYN, SMOTENC, SMOTEN, BorderlineSMOTE
from imblearn.combine import SMOTEENN, SMOTETomek
from sklearn.ensemble import IsolationForest
from sklearn.neighbors import NearestNeighbors
from sklearn.metrics import pairwise_distances
import matplotlib.pyplot as plt
import seaborn as sns
from scipy.stats import probplot
import joblib  # For saving/loading transformers
from inspect import signature  # For parameter validation in SMOTE
from functools import wraps
import re
from feature_engine.selection import DropHighPSIFeatures

def dtw_path(s1: np.ndarray, s2: np.ndarray) -> list:
    """
    Compute the DTW cost matrix and return the optimal warping path.
    
    Args:
        s1: Sequence 1, shape (n, features)
        s2: Sequence 2, shape (m, features)
    
    Returns:
        path: A list of index pairs [(i, j), ...] indicating the alignment.
    """
    n, m = len(s1), len(s2)
    cost = np.full((n+1, m+1), np.inf)
    cost[0, 0] = 0

    # Build the cost matrix
    for i in range(1, n+1):
        for j in range(1, m+1):
            dist = np.linalg.norm(s1[i-1] - s2[j-1])
            cost[i, j] = dist + min(cost[i-1, j], cost[i, j-1], cost[i-1, j-1])

    # Backtracking to find the optimal path
    i, j = n, m
    path = []
    while i > 0 and j > 0:
        path.append((i-1, j-1))
        directions = [cost[i-1, j], cost[i, j-1], cost[i-1, j-1]]
        min_index = np.argmin(directions)
        if min_index == 0:
            i -= 1
        elif min_index == 1:
            j -= 1
        else:
            i -= 1
            j -= 1
    path.reverse()
    return path

def warp_sequence(seq: np.ndarray, path: list, target_length: int) -> np.ndarray:
    """
    Warp the given sequence to match the target length based on the DTW warping path.
    
    Args:
        seq: Original sequence, shape (n, features)
        path: Warping path from dtw_path (list of tuples)
        target_length: Desired sequence length (typically the reference length)
    
    Returns:
        aligned_seq: Warped sequence with shape (target_length, features)
    """
    aligned_seq = np.zeros((target_length, seq.shape[1]))
    # Create mapping: for each target index, collect corresponding indices from seq
    mapping = {t: [] for t in range(target_length)}
    for (i, j) in path:
        mapping[j].append(i)
    
    for t in range(target_length):
        indices = mapping[t]
        if indices:
            aligned_seq[t] = np.mean(seq[indices], axis=0)
        else:
            # If no alignment, reuse the previous value (or use interpolation)
            aligned_seq[t] = aligned_seq[t-1] if t > 0 else seq[0]
    return aligned_seq
    


class DataPreprocessor:
    def __init__(
        self,
        model_type: str,
        y_variable: List[str],
        ordinal_categoricals: List[str],
        nominal_categoricals: List[str],
        numericals: List[str],
        mode: str,  # 'train', 'predict', 'clustering'
        options: Optional[Dict] = None,
        debug: bool = False,
        normalize_debug: bool = False,
        normalize_graphs_output: bool = False,
        graphs_output_dir: str = './plots',
        transformers_dir: str = './transformers',
        time_series_sequence_mode: str = "pad",  # "set_window", "pad", "dtw", "variable_length"
        sequence_categorical: Optional[List[str]] = None,
        sequence_dtw_or_pad_categorical: Optional[List[str]] = None
    ):
        # --- Process and validate sequence grouping parameters ---
        self.sequence_categorical = list(sequence_categorical) if sequence_categorical else []
        self.sequence_dtw_or_pad_categorical = list(sequence_dtw_or_pad_categorical) if sequence_dtw_or_pad_categorical else []
        
        if set(self.sequence_categorical) & set(self.sequence_dtw_or_pad_categorical):
            conflicting = set(self.sequence_categorical) & set(self.sequence_dtw_or_pad_categorical)
            raise ValueError(f"Categorical conflict in {conflicting}. Top-level and sub-phase groups must form a strict hierarchy")

        # --- Basic attribute assignments ---
        self.model_type = model_type
        self.y_variable = y_variable
        # Defensive coding: if feature lists are None, set them to empty lists
        self.ordinal_categoricals = ordinal_categoricals if ordinal_categoricals is not None else []
        self.nominal_categoricals = nominal_categoricals if nominal_categoricals is not None else []
        self.numericals = numericals if numericals is not None else []
        self.mode = mode.lower()
        if self.mode not in ['train', 'predict', 'clustering']:
            raise ValueError("Mode must be one of 'train', 'predict', or 'clustering'.")
        self.options = options or {}
        # Default outlier method for time series (fallback if not provided in time_series config)
        self.ts_outlier_method = self.options.get('handle_outliers', {}).get('time_series_method', 'median').lower()

        self.debug = debug
        self.normalize_debug = normalize_debug
        self.normalize_graphs_output = normalize_graphs_output
        self.graphs_output_dir = graphs_output_dir
        self.transformers_dir = transformers_dir

        # --- Initialize logging ---
        self.logger = logging.getLogger(self.__class__.__name__)
        self.logger.setLevel(logging.DEBUG if self.debug else logging.INFO)
        handler = logging.StreamHandler()
        formatter = logging.Formatter('%(asctime)s [%(levelname)s] %(message)s')
        handler.setFormatter(formatter)
        if not self.logger.handlers:
            self.logger.addHandler(handler)

        # --- Log the initialized feature lists for debugging ---
        self.logger.debug(f"Initialized ordinal_categoricals: {self.ordinal_categoricals}")
        self.logger.debug(f"Initialized nominal_categoricals: {self.nominal_categoricals}")
        self.logger.debug(f"Initialized numericals: {self.numericals}")

        # --- NEW: Extract time series parameters from config ---
        time_series_config = self.options  # Use options directly
        self.time_series_enabled = time_series_config.get('enabled', False)
        self.time_column = time_series_config.get('time_column')
        self.horizon = time_series_config.get('horizon')
        self.step_size = time_series_config.get('step_size')
        self.ts_outlier_method = time_series_config.get('ts_outlier_handling_method', self.ts_outlier_method).lower()
        self.time_series_sequence_mode = time_series_sequence_mode.lower()

        # --- Store expected model shape (will be populated during load_transformers) ---
        self.expected_model_shape = None
        self.actual_output_shape = None

        sequence_modes_config = time_series_config.get('sequence_modes', {})
        if self.time_series_sequence_mode == "set_window":
            set_window_config = sequence_modes_config.get('set_window', {})
            self.window_size = set_window_config.get('window_size')
            self.max_sequence_length = set_window_config.get('max_sequence_length')
            
            # UPDATED: Only validate for train mode, in predict mode we'll load from transformers
            if self.mode == 'train':
                if self.window_size is None or self.step_size is None:
                    raise ValueError("Both 'window_size' (from set_window config) and 'step_size' must be provided for 'set_window' mode in training.")
            else:
                if self.window_size is None or self.step_size is None:
                    self.logger.warning("In predict mode, 'window_size' and/or 'step_size' are missing; these will be loaded from saved transformers if available.")
        elif self.time_series_sequence_mode == "pad":
            pad_config = sequence_modes_config.get('pad', {})
            self.padding_side = pad_config.get('padding_side', 'post')
            self.pad_threshold = pad_config.get('pad_threshold')
            self.max_sequence_length = None
        elif self.time_series_sequence_mode == "dtw":
            dtw_config = sequence_modes_config.get('dtw', {})
            self.use_dtw = dtw_config.get('use_dtw', True)
            self.reference_sequence = dtw_config.get('reference_sequence', 'max')
            self.dtw_threshold = dtw_config.get('dtw_threshold')
            self.max_sequence_length = None
        elif self.time_series_sequence_mode == "variable_length":
            var_config = sequence_modes_config.get('variable_length', {})
            self.window_size = var_config.get('window_size')
            self.max_sequence_length = None
        else:
            raise ValueError(f"Unknown time series sequence mode: {self.time_series_sequence_mode}")

        self.max_phase_distortion = self.options.get('max_phase_distortion', 0.3)
        self.max_length_variance = self.options.get('max_length_variance', 5)

        if self.sequence_categorical and self.sequence_dtw_or_pad_categorical:
            overlap = set(self.sequence_categorical) & set(self.sequence_dtw_or_pad_categorical)
            if overlap:
                raise ValueError(f"Overlapping grouping columns: {overlap}. Top-level and sub-phase groups must be distinct")

        # --- Determine model category ---
        self.hierarchical_categories = {}
        model_type_lower = self.model_type.lower()
        if any(kw in model_type_lower for kw in ['lstm', 'rnn', 'time series']):
            self.model_category = 'time_series'
        else:
            self.model_category = self.map_model_type_to_category()

        self.categorical_indices = []
        if self.model_category == 'unknown':
            self.logger.error(f"Model category for '{self.model_type}' is unknown. Check your configuration.")
            raise ValueError(f"Model category for '{self.model_type}' is unknown. Check your configuration.")
        if self.mode in ['train', 'predict']:
            if not self.y_variable:
                raise ValueError("Target variable 'y_variable' must be specified for supervised models in train/predict mode.")
        elif self.mode == 'clustering':
            self.y_variable = []

        self.time_step = self.options.get('time_step', 1/60) if self.options else 1/60

        # Initialize other variables (scalers, transformers, etc.)
        self.scaler = None
        self.transformer = None
        self.ordinal_encoder = None
        self.nominal_encoder = None
        self.preprocessor = None
        self.smote = None
        # Robust initialization of feature_reasons using a combined list of features
        all_features = []
        all_features.extend(self.ordinal_categoricals)
        all_features.extend(self.nominal_categoricals)
        all_features.extend(self.numericals)
        self.feature_reasons = {col: '' for col in all_features}
        self.preprocessing_steps = []
        self.normality_results = {}
        self.features_to_transform = []
        self.nominal_encoded_feature_names = []
        self.final_feature_order = []

        # Additional initialization for clustering
        self.cluster_transformers = {}
        self.cluster_model = None
        self.cluster_labels = None
        self.silhouette_score = None

        self.imbalance_threshold = self.options.get('smote_recommendation', {}).get('imbalance_threshold', 0.1)
        self.noise_threshold = self.options.get('smote_recommendation', {}).get('noise_threshold', 0.1)
        self.overlap_threshold = self.options.get('smote_recommendation', {}).get('overlap_threshold', 0.1)
        self.boundary_threshold = self.options.get('smote_recommendation', {}).get('boundary_threshold', 0.1)

        self.pipeline = None

        # Initialize logging again to ensure handler consistency
        self.logger = logging.getLogger(self.__class__.__name__)
        self.logger.setLevel(logging.DEBUG if self.debug else logging.INFO)
        handler = logging.StreamHandler()
        formatter = logging.Formatter('%(asctime)s [%(levelname)s] %(message)s')
        handler.setFormatter(formatter)
        if not self.logger.handlers:
            self.logger.addHandler(handler)

        # Reinitialize feature_reasons (robustly) for clustering if needed
        all_features = []
        all_features.extend(self.ordinal_categoricals)
        all_features.extend(self.nominal_categoricals)
        all_features.extend(self.numericals)
        self.feature_reasons = {col: '' for col in all_features}
        if self.model_category == 'clustering':
            self.feature_reasons['all_numericals'] = ''



    def get_debug_flag(self, flag_name: str) -> bool:
        """
        Retrieve the value of a specific debug flag from the options.
        Args:
            flag_name (str): The name of the debug flag.
        Returns:
            bool: The value of the debug flag.
        """
        return self.options.get(flag_name, False)

    def _log(self, message: str, step: str, level: str = 'info'):
        """
        Internal method to log messages based on the step-specific debug flags.
        
        Args:
            message (str): The message to log.
            step (str): The preprocessing step name.
            level (str): The logging level ('info', 'debug', etc.).
        """
        debug_flag = self.get_debug_flag(f'debug_{step}')
        if debug_flag:
            if level == 'debug':
                self.logger.debug(message)
            elif level == 'info':
                self.logger.info(message)
            elif level == 'warning':
                self.logger.warning(message)
            elif level == 'error':
                self.logger.error(message)

    def map_model_type_to_category(self) -> str:
        """
        Map the model_type string to a predefined category based on keywords.

        Returns:
            str: The model category ('classification', 'regression', 'clustering', etc.).
        """
        classification_keywords = ['classifier', 'classification', 'logistic', 'svm', 'support vector machine', 'knn', 'neural network']
        regression_keywords = ['regressor', 'regression', 'linear', 'knn', 'neural network']  # Removed 'svm'
        clustering_keywords = ['k-means', 'clustering', 'dbscan', 'kmodes', 'kprototypes']

        model_type_lower = self.model_type.lower()

        for keyword in classification_keywords:
            if keyword in model_type_lower:
                return 'classification'

        for keyword in regression_keywords:
            if keyword in model_type_lower:
                return 'regression'

        for keyword in clustering_keywords:
            if keyword in model_type_lower:
                return 'clustering'

        return 'unknown'

    def filter_columns(self, df: pd.DataFrame) -> pd.DataFrame:
        step_name = "filter_columns"
        self.logger.info(f"Step: {step_name}")

        # Combine all feature lists from configuration
        desired_features = self.numericals + self.ordinal_categoricals + self.nominal_categoricals

        # For time series models, ensure the time column is included
        if self.model_category == 'time_series' and self.time_column:
            if self.time_column not in df.columns:
                self.logger.error(f"Time column '{self.time_column}' not found in input data.")
                raise ValueError(f"Time column '{self.time_column}' not found in the input data.")
            if self.time_column not in desired_features:
                desired_features.append(self.time_column)

        # *** NEW: Auto-include sequence grouping columns ***
        # Ensure that any column listed in sequence_categorical is added to desired_features if present.
        for col in self.sequence_categorical:
            if col not in desired_features and col in df.columns:
                desired_features.append(col)
                self.logger.debug(f"Auto-added sequence column '{col}' to desired features")

        # Debug log: report target variable info
        self.logger.debug(f"y_variable provided: {self.y_variable}")
        if self.y_variable and all(col in df.columns for col in self.y_variable):
            self.logger.debug(f"First value in target column(s): {df[self.y_variable].iloc[0].to_dict()}")

        # For 'train' mode, ensure the target variable is present and excluded from features
        if self.mode == 'train':
            if not all(col in df.columns for col in self.y_variable):
                missing_y = [col for col in self.y_variable if col not in df.columns]
                self.logger.error(f"Target variable(s) {missing_y} not found in the input data.")
                raise ValueError(f"Target variable(s) {missing_y} not found in the input data.")
            desired_features = [col for col in desired_features if col not in self.y_variable]
            filtered_df = df[desired_features + self.y_variable].copy()
        else:
            filtered_df = df[desired_features].copy()

        # Check that all desired features are present in the input DataFrame
        missing_features = [col for col in desired_features if col not in df.columns]
        if missing_features:
            self.logger.error(f"The following required features are missing in the input data: {missing_features}")
            raise ValueError(f"The following required features are missing in the input data: {missing_features}")

        # Additional numeric type check for expected numeric columns
        for col in self.numericals:
            if col in filtered_df.columns and not np.issubdtype(filtered_df[col].dtype, np.number):
                raise TypeError(f"Numerical column '{col}' has non-numeric dtype {filtered_df[col].dtype}")

        self.logger.info(f"✅ Filtered DataFrame to include only specified features. Shape: {filtered_df.shape}")
        self.logger.debug(f"Selected Features: {desired_features}")
        if self.mode == 'train':
            self.logger.debug(f"Retained Target Variable(s): {self.y_variable}")

        return filtered_df




    def _group_top_level(self, data: pd.DataFrame):
        """
        Group the data based on top-level sequence categorical variables.
        Returns the grouped DataFrames (without converting them to NumPy arrays)
        to ensure that subsequent processing (such as sub-phase segmentation) has access
        to DataFrame methods like .groupby and .columns.
        """
        if not self.sequence_categorical:
            return [('default_group', data)]
        
        groups = data.groupby(self.sequence_categorical)
        self.logger.debug(f"Group keys: {list(groups.groups.keys())}")
        
        validated_groups = []
        for name, group in groups:
            try:
                self.logger.debug(f"Group '{name}' type: {type(group)}, Shape: {group.shape if hasattr(group, 'shape') else 'N/A'}")
            except Exception as e:
                self.logger.error(f"Error obtaining shape for group {name}: {e}")
            if isinstance(group, pd.DataFrame):
                # *** FIX: Return the DataFrame (not group.values) so that it retains the .columns attribute ***
                validated_groups.append((name, group))
            else:
                self.logger.warning(f"Unexpected group type {type(group)} for group {name}")
        return validated_groups



    # Add the mapping as a class-level constant
    # Add the enhanced phase name mapping as a class-level constant.
    PHASE_NAME_MAPPING = {
        # Wind-up variants (enhanced)
        "windup": "windup",
        "wind-up": "windup",
        "wind_up": "windup",
        "wind up": "windup",
        "windingup": "windup",
        "winding-up": "windup",
        "winding_up": "windup",
        "winding up": "windup",
        "wind": "windup",
        "setup": "windup",
        "preparation": "windup",
        "initial": "windup",
        "stance": "windup",
        "Wind-Up": "windup",
        "Wind Up": "windup",
        "Windup": "windup",
        "Wind_Up": "windup",
        
        # Arm cocking variants
        "arm cocking": "arm_cocking",
        "arm-cocking": "arm_cocking",
        "arm_cocking": "arm_cocking",
        "armcocking": "arm_cocking",
        
        # Arm acceleration variants
        "arm acceleration": "arm_acceleration",
        "arm-acceleration": "arm_acceleration",
        "arm_acceleration": "arm_acceleration",
        "armacceleration": "arm_acceleration",
        
        # Follow through variants
        "follow through": "follow_through",
        "follow-through": "follow_through",
        "follow_through": "follow_through",
        "followthrough": "follow_through",
        
        # Other potential phases
        "stride": "stride"
    }

    def validate_phase_coverage(self, data: pd.DataFrame) -> Dict:
        """
        Check phase coverage and consistency across the dataset.
        
        Logs the raw phase value counts and the normalized phase counts.
        It then verifies whether each standard phase is present and warns if any are missing.
        
        Args:
            data (pd.DataFrame): The input DataFrame containing phase information.
            
        Returns:
            Dict: A dictionary with normalized phase counts.
        """
        if 'pitch_phase_biomech' not in data.columns:
            self.logger.warning("Phase column 'pitch_phase_biomech' not found in data.")
            return {}
        
        phase_counts = data['pitch_phase_biomech'].value_counts()
        self.logger.info(f"Raw phase value counts:\n{phase_counts}")
        
        # Normalize and aggregate counts for standard phases.
        normalized_counts = {}
        for phase, count in phase_counts.items():
            norm_phase = self.normalize_phase_key(str(phase))
            normalized_counts[norm_phase] = normalized_counts.get(norm_phase, 0) + count
        
        self.logger.info(f"Normalized phase counts:\n{normalized_counts}")
        
        # Check for missing standard phases.
        standard_phases = ["windup", "arm_cocking", "arm_acceleration", "follow_through"]
        for phase in standard_phases:
            if phase not in normalized_counts:
                self.logger.warning(f"Standard phase '{phase}' not found in any form in the data.")
        
        return normalized_counts

    
    @staticmethod
    def normalize_phase_key(key: str) -> str:
        """Enhanced normalization with improved debug logging."""
        logger = logging.getLogger('PhaseNormalization')
        logger.debug(f"Normalizing: '{key}'")
        
        # Handle non-string input
        if not isinstance(key, str):
            result = str(key)
            logger.debug(f"Converted non-string to: '{result}'")
            return result
        
        # Convert to lowercase first to handle capitalization uniformly
        key_lower = key.lower().strip()
        logger.debug(f"Lowercase + strip: '{key_lower}'")
        
        # Check mapping first (case-insensitive match)
        if key_lower in DataPreprocessor.PHASE_NAME_MAPPING:
            result = DataPreprocessor.PHASE_NAME_MAPPING[key_lower]
            logger.debug(f"Direct mapping match: '{result}'")
            return result
        
        # Apply standard transformations
        key = key.strip()
        key = re.sub(r'(?<=[a-z])(?=[A-Z])', '_', key)
        key = key.replace(" ", "_")
        key = re.sub(r'[-‐‑–—]', '', key)
        normalized = key.lower()
        
        logger.debug(f"Standard normalization: '{normalized}'")
        
        # Check mapping again after normalization
        if normalized in DataPreprocessor.PHASE_NAME_MAPPING:
            result = DataPreprocessor.PHASE_NAME_MAPPING[normalized]
            logger.debug(f"Mapping after normalization: '{result}'")
            return result
        
        logger.debug(f"Final (unmapped): '{normalized}'")
        return normalized



        
    @staticmethod
    def safe_array_conversion(data):
        """
        Convert input data to a NumPy array if it is not already.
        Handles both structured and unstructured arrays.
        Raises a TypeError if the input data is a dictionary.
        """
        if isinstance(data, dict):
            raise TypeError("Input data is a dict. Expected array-like input.")
        if isinstance(data, np.ndarray):
            if data.dtype.names:
                # For structured arrays, view as float32 and reshape to combine fields.
                return data.view(np.float32).reshape(data.shape + (-1,))
            return data
        elif hasattr(data, 'values'):
            arr = data.values
            if arr.dtype.names:
                return arr.view(np.float32).reshape(arr.shape + (-1,))
            return arr
        else:
            return np.array(data)
            
    def validate_group_phases(self, group_key, group_data):
        """
        Validate the phases for a given group by logging:
        - The raw phases found in the group's 'pitch_phase_biomech' column.
        - The normalized phases after applying normalization.
        - The required phases as per the phase order.
        
        This function aids in tracing how the phase values are processed.
        
        Args:
            group_key: Identifier for the group.
            group_data (pd.DataFrame): The group data containing phase information.
        """
        if 'pitch_phase_biomech' not in group_data.columns:
            self.logger.warning(f"Group {group_key} does not contain 'pitch_phase_biomech' column for phase validation.")
            return
        raw_phases = group_data['pitch_phase_biomech'].unique()
        self.logger.debug(f"Raw phases for group {group_key}: {raw_phases}")
        
        normalized_phases = [self.normalize_phase_key(p) for p in raw_phases if isinstance(p, str)]
        self.logger.debug(f"Normalized phases for group {group_key}: {normalized_phases}")
        
        required_phases = self.get_phase_order()
        self.logger.debug(f"Required phases: {required_phases}")


    def _segment_subphases(self, group_data: pd.DataFrame, skip_min_samples=False):
        """
        Segment a group's data into sub-phases based on the secondary grouping.
        For each phase, converts the data to a numeric NumPy array and maps it with a normalized phase key.
        
        Adaptive minimum sample requirement:
          - In prediction mode, set MIN_PHASE_SAMPLES = 2 (more lenient).
          - Otherwise, use 5 (or 1 if skip_min_samples is True).
        
        Args:
            group_data (pd.DataFrame): Data for one group.
            skip_min_samples (bool): If True, lower the sample threshold.
        
        Returns:
            dict: Mapping from normalized phase keys to tuples (original key, numeric array).
        """
        if not self.sequence_dtw_or_pad_categorical:
            if self.numericals:
                group_data = group_data[[col for col in group_data.columns if col in self.numericals]]
            return {"default_phase": ("default_phase", group_data.values)}
        
        if 'pitch_phase_biomech' in group_data.columns:
            unique_phases = group_data['pitch_phase_biomech'].unique()
            self.logger.debug(f"Raw phases in group: {unique_phases}")
            for phase in unique_phases:
                norm_value = self.normalize_phase_key(str(phase))
                self.logger.debug(f"Normalization test: '{phase}' → '{norm_value}'")
        
        phase_groups = list(group_data.groupby(self.sequence_dtw_or_pad_categorical))
        
        subphases = {}
        # Adaptive minimum sample requirement:
        if self.mode == 'predict':
            MIN_PHASE_SAMPLES = 2  # More lenient for prediction
        else:
            MIN_PHASE_SAMPLES = 2 if not skip_min_samples else 1
        
        if self.time_column and self.time_column in group_data.columns:
            try:
                self._validate_timestamps(group_data)
            except Exception as e:
                self.logger.warning(f"Timestamp validation error: {e}")
        
        for phase_key, phase_df in phase_groups:
            if isinstance(phase_key, tuple):
                stable_key = "|".join(map(str, phase_key))
            else:
                stable_key = str(phase_key)
            normalized_key = DataPreprocessor.normalize_phase_key(stable_key)
            self.logger.debug(f"Sub-phase raw key '{stable_key}' normalized to: '{normalized_key}'")
                            
            if not isinstance(phase_df, (pd.DataFrame, np.ndarray)):
                self.logger.error(f"Invalid type {type(phase_df)} for phase '{stable_key}'. Skipping.")
                continue

            phase_length = len(phase_df)
            self.logger.debug(f"Phase '{stable_key}' (normalized: '{normalized_key}') length: {phase_length}")

            if phase_length < MIN_PHASE_SAMPLES:
                self.logger.warning(f"Skipping short phase '{stable_key}' (length {phase_length} < {MIN_PHASE_SAMPLES})")
                continue

            if isinstance(phase_df, pd.DataFrame):
                numeric_phase_df = phase_df[[col for col in phase_df.columns if col in self.numericals]] if self.numericals else phase_df
                try:
                    numeric_phase_array = self.safe_array_conversion(numeric_phase_df)
                except Exception as e:
                    self.logger.error(f"Array conversion failed for phase '{stable_key}': {e}")
                    continue
            elif isinstance(phase_df, np.ndarray):
                numeric_phase_array = phase_df
            else:
                self.logger.error(f"Unexpected type {type(phase_df)} for phase '{stable_key}'. Skipping.")
                continue

            if numeric_phase_array.ndim == 1:
                numeric_phase_array = numeric_phase_array.reshape(-1, 1)
                self.logger.debug(f"Phase '{stable_key}' reshaped to 2D: {numeric_phase_array.shape}")

            subphases[normalized_key] = (stable_key, numeric_phase_array)
        
        self.logger.debug(f"Normalized phase keys obtained: {list(subphases.keys())}")
        expected_set = set(self.get_phase_order())
        self.logger.debug(f"Expected phase keys: {expected_set}")
        
        if not subphases:
            self.logger.error("No valid subphases detected in this group.")
            raise ValueError("Subphase segmentation produced an empty dictionary.")
        return subphases




    def _validate_timestamps(self, phase_data: pd.DataFrame):
        """
        Validate that timestamps in phase_data have no large discontinuities (>1 second gap).
        Logs a warning if a gap is detected.
        """
        time_col = self.time_column
        if time_col not in phase_data.columns:
            return
        diffs = phase_data[time_col].diff().dropna()
        diffs_seconds = diffs.dt.total_seconds()
        if (diffs_seconds  > 1.0).any():
            gap_loc = diffs_seconds .idxmax()
            self.logger.warning(
                f"Timestamp jump in group {getattr(phase_data, 'name', 'unknown')}: {diffs_seconds [gap_loc]:.2f}s gap at index {gap_loc}"
            )



    @staticmethod
    def pad_sequence(seq: np.ndarray, target_length: int) -> np.ndarray:
        """
        Pad or truncate the given sequence to match the target length.
        Ensures that the input is a 2D array. For a 1D input, reshapes it to (-1, 1).
        A minimum target_length of 5 is enforced to avoid degenerate sequences.
        """
        seq = np.array(seq)
        if seq.ndim == 1:
            seq = seq.reshape(-1, 1)  # Ensure the array is 2D
        current_length = seq.shape[0]
        target_length = max(target_length, 5)  # Enforce a minimum target length of 5
        if current_length >= target_length:
            return seq[:target_length]
        else:
            pad_width = target_length - current_length
            padding = np.zeros((pad_width, seq.shape[1]))
            return np.concatenate([seq, padding], axis=0)


    def _align_phase(self, phase_data, target_length: int, phase_name: str) -> np.ndarray:
        """
        Align a sub-phase's sequence to a target length using DTW (if enabled) or padding.
        Validates that the resulting array is 2D and has exactly target_length rows.
        If DTW alignment results in an array of incorrect shape, falls back to padding.
        """
        if isinstance(phase_data, pd.DataFrame):
            phase_data = phase_data[[col for col in phase_data.columns if col in self.numericals]] if self.numericals else phase_data.copy()

        if isinstance(phase_data, dict):
            self.logger.error(f"Received dict instead of array. Keys: {list(phase_data.keys())}")
            raise TypeError("Phase data must be array-like, not dict. Check your grouping logic.")

        phase_array = self.safe_array_conversion(phase_data)
        self.logger.debug(f"Aligning phase '{phase_name}' with input type {type(phase_data)} and shape {phase_array.shape}")

        if phase_array.ndim == 1:
            phase_array = phase_array.reshape(1, -1)
            self.logger.debug(f"Phase '{phase_name}' was 1D and has been reshaped to {phase_array.shape}")

        current_length = phase_array.shape[0]

        if phase_array.ndim != 2:
            self.logger.error(f"Invalid input shape {phase_array.shape} - expected a 2D array")
            raise ValueError("DTW alignment requires a 2D array input")
        
        if not np.issubdtype(phase_array.dtype, np.number):
            self.logger.warning(f"Non-numeric dtype detected: {phase_array.dtype}. Converting to np.float32.")
            try:
                phase_array = phase_array.astype(np.float32)
            except Exception as e:
                self.logger.error(f"Failed conversion to float32: {e}")
                raise

        try:
            if self.time_series_sequence_mode == "dtw":
                distortion = abs(current_length - target_length) / target_length
                self.logger.debug(f"[Distortion Analysis] Phase '{phase_name}': raw length {current_length} vs target {target_length} | Distortion: {distortion:.1%}")
                MAX_DISTORTION = 0.2
                if distortion > MAX_DISTORTION:
                    raise ValueError(f"Excessive DTW distortion {distortion:.1%} exceeds threshold of {MAX_DISTORTION:.0%}")
                alignment_path = dtw_path(phase_array, phase_array)
                aligned_seq = warp_sequence(phase_array, alignment_path, target_length)
            else:
                aligned_seq = self.pad_sequence(phase_array, target_length)
        except Exception as e:
            self.logger.warning(f"Alignment failed for phase '{phase_name}': {e}. Falling back to padding.")
            aligned_seq = self.pad_sequence(phase_array, target_length)

        if aligned_seq.shape[0] != target_length:
            self.logger.error(f"Phase '{phase_name}' alignment resulted in {aligned_seq.shape[0]} steps (expected {target_length}).")
            raise ValueError(f"Alignment for phase '{phase_name}' did not yield exactly {target_length} steps.")

        self.logger.debug(f"Phase '{phase_name}' aligned successfully to shape {aligned_seq.shape}")
        return aligned_seq



    def create_sequences(self, X: np.ndarray, y: np.ndarray) -> Tuple[Any, Any]:
        X_seq, y_seq = [], []
        if self.time_series_sequence_mode == "set_window":
            # Sliding window approach
            for i in range(0, len(X) - self.window_size - self.horizon + 1, self.step_size):
                seq_X = X[i:i+self.window_size]
                seq_y = y[i+self.window_size:i+self.window_size+self.horizon]
                if self.max_sequence_length and seq_X.shape[0] < self.max_sequence_length:
                    pad_width = self.max_sequence_length - seq_X.shape[0]
                    seq_X = np.pad(seq_X, ((0, pad_width), (0, 0)), mode='constant', constant_values=0)
                X_seq.append(seq_X)
                y_seq.append(seq_y)
        
        elif self.time_series_sequence_mode in ["dtw", "pad", "variable_length"]:
            # Full sequence processing
            X_seq = X
            y_seq = y
        
        else:
            raise ValueError(f"Unsupported time_series_sequence_mode: {self.time_series_sequence_mode}")

        return np.array(X_seq), np.array(y_seq)


    def split_dataset(
        self,
        X: pd.DataFrame,
        y: Optional[pd.Series] = None,
        split_ratio: float = 0.2,
        time_split_column: Optional[str] = None,
        time_split_value: Optional[Any] = None
    ) -> Tuple[pd.DataFrame, Optional[pd.DataFrame], Optional[pd.Series], Optional[pd.Series]]:
        """
        Split the dataset into training and testing sets while retaining original indices.
        Supports both ratio-based splitting and time-based splitting.

        Args:
            X (pd.DataFrame): Features.
            y (Optional[pd.Series]): Target variable.
            split_ratio (float): Proportion of the dataset to include in the test split (default: 0.2).
            time_split_column (Optional[str]): Column name for time-based splitting.
            time_split_value (Optional[Any]): Value to split on in the time_split_column.

        Returns:
            Tuple[pd.DataFrame, Optional[pd.DataFrame], Optional[pd.Series], Optional[pd.Series]]: X_train, X_test, y_train, y_test
        """
        step_name = "split_dataset"
        self.logger.info("Step: Split Dataset into Train and Test")

        # Debugging Statements
        self._log(f"Before Split - X shape: {X.shape}", step_name, 'debug')
        if y is not None:
            self._log(f"Before Split - y shape: {y.shape}", step_name, 'debug')
        else:
            self._log("Before Split - y is None", step_name, 'debug')

        # Determine splitting based on mode
        if self.mode == 'train' and self.model_category in ['classification', 'regression']:
            # Check if time-based splitting is specified
            if time_split_column and time_split_value:
                if time_split_column not in X.columns:
                    raise ValueError(f"Time split column '{time_split_column}' not found in the dataset.")
                
                # Perform time-based split
                self._log(f"Performing time-based split on column '{time_split_column}' with value {time_split_value}", step_name, 'debug')
                train_mask = X[time_split_column] <= time_split_value
                test_mask = X[time_split_column] > time_split_value
                
                X_train = X[train_mask].copy()
                X_test = X[test_mask].copy()
                
                if y is not None:
                    y_train = y[train_mask].copy()
                    y_test = y[test_mask].copy()
                else:
                    y_train = None
                    y_test = None
                    
                self._log("Performed time-based split.", step_name, 'debug')
            else:
                # Perform ratio-based split
                if self.model_category == 'classification':
                    stratify = y if self.options.get('split_dataset', {}).get('stratify_for_classification', False) else None
                    random_state = self.options.get('split_dataset', {}).get('random_state', 42)
                    X_train, X_test, y_train, y_test = train_test_split(
                        X, y, 
                        test_size=split_ratio,
                        stratify=stratify, 
                        random_state=random_state
                    )
                    self._log(f"Performed stratified split for classification with ratio {split_ratio}.", step_name, 'debug')
                elif self.model_category == 'regression':
                    X_train, X_test, y_train, y_test = train_test_split(
                        X, y, 
                        test_size=split_ratio,
                        random_state=self.options.get('split_dataset', {}).get('random_state', 42)
                    )
                    self._log(f"Performed random split for regression with ratio {split_ratio}.", step_name, 'debug')
        else:
            # For 'predict' and 'clustering' modes or other categories
            X_train = X.copy()
            X_test = None
            y_train = y.copy() if y is not None else None
            y_test = None
            self.logger.info(f"No splitting performed for mode '{self.mode}' or model category '{self.model_category}'.")

        self.preprocessing_steps.append("Split Dataset into Train and Test")

        # Keep Indices Aligned Through Each Step
        if X_test is not None and y_test is not None:
            # Sort both X_test and y_test by index
            X_test = X_test.sort_index()
            y_test = y_test.sort_index()
            self.logger.debug("Sorted X_test and y_test by index for alignment.")

        # Debugging: Log post-split shapes and index alignment
        self._log(f"After Split - X_train shape: {X_train.shape}, X_test shape: {X_test.shape if X_test is not None else 'N/A'}", step_name, 'debug')
        if self.model_category == 'classification' and y_train is not None and y_test is not None:
            self.logger.debug(f"Class distribution in y_train:\n{y_train.value_counts(normalize=True)}")
            self.logger.debug(f"Class distribution in y_test:\n{y_test.value_counts(normalize=True)}")
        elif self.model_category == 'regression' and y_train is not None and y_test is not None:
            self.logger.debug(f"y_train statistics:\n{y_train.describe()}")
            self.logger.debug(f"y_test statistics:\n{y_test.describe()}")

        # Check index alignment
        if y_train is not None and X_train.index.equals(y_train.index):
            self.logger.debug("X_train and y_train indices are aligned.")
        else:
            self.logger.warning("X_train and y_train indices are misaligned.")

        if X_test is not None and y_test is not None and X_test.index.equals(y_test.index):
            self.logger.debug("X_test and y_test indices are aligned.")
        elif X_test is not None and y_test is not None:
            self.logger.warning("X_test and y_test indices are misaligned.")

        return X_train, X_test, y_train, y_test


    def handle_missing_values(self, X_train: pd.DataFrame, X_test: Optional[pd.DataFrame] = None) -> Tuple[pd.DataFrame, Optional[pd.DataFrame]]:
        """
        Handle missing values for numerical and categorical features based on user options.
        """
        step_name = "handle_missing_values"
        self.logger.info("Step: Handle Missing Values")

        # Fetch user-defined imputation options or set defaults
        impute_options = self.options.get('handle_missing_values', {})
        numerical_strategy = impute_options.get('numerical_strategy', {})
        categorical_strategy = impute_options.get('categorical_strategy', {})

        # Numerical Imputation
        numerical_imputer = None
        new_columns = []
        if self.numericals:
            if self.model_category in ['regression', 'classification', 'clustering']:
                default_num_strategy = 'median'  # Changed to median as per preprocessor_config_baseball.yaml
            else:
                default_num_strategy = 'median'
            num_strategy = numerical_strategy.get('strategy', default_num_strategy)
            num_imputer_type = numerical_strategy.get('imputer', 'SimpleImputer')  # Can be 'SimpleImputer', 'KNNImputer', etc.

            self._log(f"Numerical Imputation Strategy: {num_strategy.capitalize()}, Imputer Type: {num_imputer_type}", step_name, 'debug')

            # Initialize numerical imputer based on user option
            if num_imputer_type == 'SimpleImputer':
                numerical_imputer = SimpleImputer(strategy=num_strategy)
            elif num_imputer_type == 'KNNImputer':
                knn_neighbors = numerical_strategy.get('knn_neighbors', 5)
                numerical_imputer = KNNImputer(n_neighbors=knn_neighbors)
            else:
                self.logger.error(f"Numerical imputer type '{num_imputer_type}' is not supported.")
                raise ValueError(f"Numerical imputer type '{num_imputer_type}' is not supported.")

            # Fit and transform ONLY on X_train
            X_train[self.numericals] = numerical_imputer.fit_transform(X_train[self.numericals])
            self.numerical_imputer = numerical_imputer  # Assign to self for saving
            self.feature_reasons.update({col: self.feature_reasons.get(col, '') + f'Numerical: {num_strategy.capitalize()} Imputation | ' for col in self.numericals})
            new_columns.extend(self.numericals)

            if X_test is not None:
                # Transform ONLY on X_test without fitting
                X_test[self.numericals] = numerical_imputer.transform(X_test[self.numericals])

        # Categorical Imputation
        categorical_imputer = None
        all_categoricals = self.ordinal_categoricals + self.nominal_categoricals
        if all_categoricals:
            default_cat_strategy = 'most_frequent'
            cat_strategy = categorical_strategy.get('strategy', default_cat_strategy)
            cat_imputer_type = categorical_strategy.get('imputer', 'SimpleImputer')

            self._log(f"Categorical Imputation Strategy: {cat_strategy.capitalize()}, Imputer Type: {cat_imputer_type}", step_name, 'debug')

            # Initialize categorical imputer based on user option
            if cat_imputer_type == 'SimpleImputer':
                categorical_imputer = SimpleImputer(strategy=cat_strategy)
            elif cat_imputer_type == 'ConstantImputer':
                fill_value = categorical_strategy.get('fill_value', 'Missing')
                categorical_imputer = SimpleImputer(strategy='constant', fill_value=fill_value)
            else:
                self.logger.error(f"Categorical imputer type '{cat_imputer_type}' is not supported.")
                raise ValueError(f"Categorical imputer type '{cat_imputer_type}' is not supported.")

            # Fit and transform ONLY on X_train
            X_train[all_categoricals] = categorical_imputer.fit_transform(X_train[all_categoricals])
            self.categorical_imputer = categorical_imputer  # Assign to self for saving
            self.feature_reasons.update({
                col: self.feature_reasons.get(col, '') + (f'Categorical: Constant Imputation (Value={categorical_strategy.get("fill_value", "Missing")}) | ' if cat_imputer_type == 'ConstantImputer' else f'Categorical: {cat_strategy.capitalize()} Imputation | ')
                for col in all_categoricals
            })
            new_columns.extend(all_categoricals)

            if X_test is not None:
                # Transform ONLY on X_test without fitting
                X_test[all_categoricals] = categorical_imputer.transform(X_test[all_categoricals])

        self.preprocessing_steps.append("Handle Missing Values")

        # Debugging: Log post-imputation shapes and missing values
        self._log(f"Completed: Handle Missing Values. Dataset shape after imputation: {X_train.shape}", step_name, 'debug')
        self._log(f"Missing values after imputation in X_train:\n{X_train.isnull().sum()}", step_name, 'debug')
        self._log(f"New columns handled: {new_columns}", step_name, 'debug')

        return X_train, X_test

    def handle_outliers(self, X_train: pd.DataFrame, y_train: Optional[pd.Series] = None) -> Tuple[pd.DataFrame, Optional[pd.Series]]:
        """
        Handle outliers based on the model's sensitivity and user options.
        For time_series models, apply a custom outlier handling using a rolling statistic (median or mean)
        to replace extreme values rather than dropping rows (to preserve temporal alignment).

        Args:
            X_train (pd.DataFrame): Training features.
            y_train (pd.Series, optional): Training target.

        Returns:
            tuple: X_train with outliers handled and corresponding y_train.
        """
        step_name = "handle_outliers"
        self.logger.info("Step: Handle Outliers")
        self._log("Starting outlier handling.", step_name, 'debug')
        initial_shape = X_train.shape[0]
        outlier_options = self.options.get('handle_outliers', {})
        zscore_threshold = outlier_options.get('zscore_threshold', 3)
        iqr_multiplier = outlier_options.get('iqr_multiplier', 1.5)
        isolation_contamination = outlier_options.get('isolation_contamination', 0.05)

        # ----- NEW: Configurable outlier handling branch for time series -----
        if self.model_category == 'time_series':
            # Check if time series outlier handling is disabled
            if self.ts_outlier_method == 'none':
                self.logger.info("Time series outlier handling disabled per config")
                return X_train, y_train

            # Validate that the method is one of the allowed options
            valid_methods = ['median', 'mean']
            if self.ts_outlier_method not in valid_methods:
                raise ValueError(f"Invalid ts_outlier_method: {self.ts_outlier_method}. Choose from {valid_methods + ['none']}")

            self.logger.info(f"Applying {self.ts_outlier_method}-based outlier replacement for time series")
            
            # Process each numerical column using the selected method
            for col in self.numericals:
                # Dynamic method selection based on configuration
                if self.ts_outlier_method == 'median':
                    rolling_stat = X_train[col].rolling(window=5, center=True, min_periods=1).median()
                elif self.ts_outlier_method == 'mean':
                    rolling_stat = X_train[col].rolling(window=5, center=True, min_periods=1).mean()
                
                # Compute rolling IQR for outlier detection
                rolling_q1 = X_train[col].rolling(window=5, center=True, min_periods=1).quantile(0.25)
                rolling_q3 = X_train[col].rolling(window=5, center=True, min_periods=1).quantile(0.75)
                rolling_iqr = rolling_q3 - rolling_q1
                
                # Create an outlier mask based on deviation from the rolling statistic
                outlier_mask = abs(X_train[col] - rolling_stat) > (iqr_multiplier * rolling_iqr)
                
                # Replace detected outliers with the corresponding rolling statistic
                X_train.loc[outlier_mask, col] = rolling_stat[outlier_mask]
                self.logger.debug(f"Replaced {outlier_mask.sum()} outliers in column '{col}' using {self.ts_outlier_method} method.")
            
            self.preprocessing_steps.append("Handle Outliers (time_series custom)")
            self._log(f"Completed: Handle Outliers for time_series. Initial samples: {initial_shape}, Final samples: {X_train.shape[0]}", step_name, 'debug')
            return X_train, y_train
        # -----------------------------------------------------------------

        # Existing outlier handling for regression and classification models
        if self.model_category in ['regression', 'classification']:
            self.logger.info(f"Applying univariate outlier detection for {self.model_category}.")
            for col in self.numericals:
                # Z-Score Filtering
                apply_zscore = outlier_options.get('apply_zscore', True)
                if apply_zscore:
                    z_scores = np.abs((X_train[col] - X_train[col].mean()) / X_train[col].std())
                    mask_z = z_scores < zscore_threshold
                    removed_z = (~mask_z).sum()
                    X_train = X_train[mask_z]
                    if y_train is not None:
                        y_train = y_train.loc[X_train.index]
                    self.feature_reasons[col] += f'Outliers handled with Z-Score Filtering (threshold={zscore_threshold}) | '
                    self._log(f"Removed {removed_z} outliers from '{col}' using Z-Score Filtering.", step_name, 'debug')

                # IQR Filtering
                apply_iqr = outlier_options.get('apply_iqr', True)
                if apply_iqr:
                    Q1 = X_train[col].quantile(0.25)
                    Q3 = X_train[col].quantile(0.75)
                    IQR = Q3 - Q1
                    lower_bound = Q1 - iqr_multiplier * IQR
                    upper_bound = Q3 + iqr_multiplier * IQR
                    mask_iqr = (X_train[col] >= lower_bound) & (X_train[col] <= upper_bound)
                    removed_iqr = (~mask_iqr).sum()
                    X_train = X_train[mask_iqr]
                    if y_train is not None:
                        y_train = y_train.loc[X_train.index]
                    self.feature_reasons[col] += f'Outliers handled with IQR Filtering (multiplier={iqr_multiplier}) | '
                    self._log(f"Removed {removed_iqr} outliers from '{col}' using IQR Filtering.", step_name, 'debug')

        elif self.model_category == 'clustering':
            self.logger.info("Applying multivariate IsolationForest for clustering.")
            contamination = isolation_contamination
            iso_forest = IsolationForest(contamination=contamination, random_state=42)
            preds = iso_forest.fit_predict(X_train[self.numericals])
            mask_iso = preds != -1
            removed_iso = (preds == -1).sum()
            X_train = X_train[mask_iso]
            if y_train is not None:
                y_train = y_train.loc[X_train.index]
            self.feature_reasons['all_numericals'] += f'Outliers handled with Multivariate IsolationForest (contamination={contamination}) | '
            self._log(f"Removed {removed_iso} outliers using Multivariate IsolationForest.", step_name, 'debug')
        else:
            self.logger.warning(f"Model category '{self.model_category}' not recognized for outlier handling.")

        self.preprocessing_steps.append("Handle Outliers")
        self._log(f"Completed: Handle Outliers. Initial samples: {initial_shape}, Final samples: {X_train.shape[0]}", step_name, 'debug')
        self._log(f"Missing values after outlier handling in X_train:\n{X_train.isnull().sum()}", step_name, 'debug')
        return X_train, y_train



    def test_normality(self, X_train: pd.DataFrame) -> Dict[str, Dict]:
        """
        Test normality for numerical features based on normality tests and user options.

        Args:
            X_train (pd.DataFrame): Training features.

        Returns:
            Dict[str, Dict]: Dictionary with normality test results for each numerical feature.
        """
        step_name = "Test for Normality"
        self.logger.info(f"Step: {step_name}")
        debug_flag = self.get_debug_flag('debug_test_normality')
        normality_results = {}

        # Fetch user-defined normality test options or set defaults
        normality_options = self.options.get('test_normality', {})
        p_value_threshold = normality_options.get('p_value_threshold', 0.05)
        skewness_threshold = normality_options.get('skewness_threshold', 1.0)
        additional_tests = normality_options.get('additional_tests', [])  # e.g., ['anderson-darling']

        for col in self.numericals:
            data = X_train[col].dropna()
            skewness = data.skew()
            kurtosis = data.kurtosis()

            # Determine which normality test to use based on sample size and user options
            test_used = 'Shapiro-Wilk'
            p_value = 0.0

            if len(data) <= 5000:
                from scipy.stats import shapiro
                stat, p_val = shapiro(data)
                test_used = 'Shapiro-Wilk'
                p_value = p_val
            else:
                from scipy.stats import anderson
                result = anderson(data)
                test_used = 'Anderson-Darling'
                # Determine p-value based on critical values
                p_value = 0.0  # Default to 0
                for cv, sig in zip(result.critical_values, result.significance_level):
                    if result.statistic < cv:
                        p_value = sig / 100
                        break

            # Apply user-defined or default criteria
            if self.model_category in ['regression', 'classification', 'clustering']:
                # Linear, Logistic Regression, and Clustering: Use p-value and skewness
                needs_transform = (p_value < p_value_threshold) or (abs(skewness) > skewness_threshold)
            else:
                # Other models: Use skewness, and optionally p-values based on options
                use_p_value = normality_options.get('use_p_value_other_models', False)
                if use_p_value:
                    needs_transform = (p_value < p_value_threshold) or (abs(skewness) > skewness_threshold)
                else:
                    needs_transform = abs(skewness) > skewness_threshold

            normality_results[col] = {
                'skewness': skewness,
                'kurtosis': kurtosis,
                'p_value': p_value,
                'test_used': test_used,
                'needs_transform': needs_transform
            }

            # Conditional Detailed Logging
            if debug_flag:
                self._log(f"Feature '{col}': p-value={p_value:.4f}, skewness={skewness:.4f}, needs_transform={needs_transform}", step_name, 'debug')

        self.normality_results = normality_results
        self.preprocessing_steps.append(step_name)

        # Completion Logging
        if debug_flag:
            self._log(f"Completed: {step_name}. Normality results computed.", step_name, 'debug')
        else:
            self.logger.info(f"Step '{step_name}' completed: Normality results computed.")

        return normality_results


    def generate_recommendations(self) -> pd.DataFrame:
        """
        Generate a table of preprocessing recommendations based on the model type, data, and user options.

        Returns:
            pd.DataFrame: DataFrame containing recommendations for each feature.
        """
        step_name = "Generate Preprocessor Recommendations"
        self.logger.info(f"Step: {step_name}")
        debug_flag = self.get_debug_flag('debug_generate_recommendations')

        # Generate recommendations based on feature reasons
        recommendations = {}
        for col in self.ordinal_categoricals + self.nominal_categoricals + self.numericals:
            reasons = self.feature_reasons.get(col, '').strip(' | ')
            recommendations[col] = reasons

        recommendations_table = pd.DataFrame.from_dict(
            recommendations, 
            orient='index', 
            columns=['Preprocessing Reason']
        )
        if debug_flag:
            self.logger.debug(f"Preprocessing Recommendations:\n{recommendations_table}")
        else:
            self.logger.info("Preprocessing Recommendations generated.")

        self.preprocessing_steps.append(step_name)

        # Completion Logging
        if debug_flag:
            self._log(f"Completed: {step_name}. Recommendations generated.", step_name, 'debug')
        else:
            self.logger.info(f"Step '{step_name}' completed: Recommendations generated.")

        return recommendations_table

    def save_transformers(self, model_input_shape=None):
        """
        Save all transformers and relevant model parameters for later use in prediction.
        
        Args:
            model_input_shape (tuple, optional): The expected input shape for prediction models
                                            (batch_size, seq_len, features)
        """
        step_name = "Save Transformers"
        self.logger.info(f"Step: {step_name}")
        
        # Ensure the transformers directory exists
        os.makedirs(self.transformers_dir, exist_ok=True)
        transformers_path = os.path.join(self.transformers_dir, 'transformers.pkl')
        
        # Basic transformers to save
        transformers = {
            'numerical_imputer': getattr(self, 'numerical_imputer', None),
            'categorical_imputer': getattr(self, 'categorical_imputer', None),
            'preprocessor': self.pipeline,
            'smote': self.smote,
            'final_feature_order': self.final_feature_order,
            'categorical_indices': self.categorical_indices,
            'model_category': self.model_category,
            'expected_model_shape': model_input_shape,  # NEW: Store the expected model shape
            'actual_output_shape': getattr(self, 'actual_output_shape', None)  # NEW: Store the actual output shape
        }
        
        # Add time series specific components if applicable
        if self.model_category == 'time_series':
            time_series_components = {
                'time_series_sequence_mode': self.time_series_sequence_mode,
                'window_size': getattr(self, 'window_size', None),
                'step_size': getattr(self, 'step_size', None),
                'horizon': getattr(self, 'horizon', None),
                'max_sequence_length': getattr(self, 'max_sequence_length', None),
                'global_target_lengths': getattr(self, 'global_target_lengths', {}),
                'sequence_categorical': self.sequence_categorical,
                'sequence_dtw_or_pad_categorical': self.sequence_dtw_or_pad_categorical
            }
            transformers.update(time_series_components)
        
        try:
            joblib.dump(transformers, transformers_path)
            self.logger.info(f"Transformers saved at '{transformers_path}'.")
        except Exception as e:
            self.logger.error(f"❌ Failed to save transformers: {e}")
            raise


    def load_transformers(self) -> dict:
        """
        Load transformers and model parameters from saved files.
        """
        step_name = "Load Transformers"
        self.logger.info(f"Step: {step_name}")
        
        transformers_path = os.path.join(self.transformers_dir, 'transformers.pkl')
        
        if not os.path.exists(transformers_path):
            self.logger.error(f"❌ Transformers file not found at '{transformers_path}'.")
            raise FileNotFoundError(f"Transformers file not found at '{transformers_path}'.")
        
        try:
            transformers = joblib.load(transformers_path)
            
            # Load basic transformers
            self.numerical_imputer = transformers.get('numerical_imputer')
            self.categorical_imputer = transformers.get('categorical_imputer')
            self.pipeline = transformers.get('preprocessor')
            self.smote = transformers.get('smote')
            self.final_feature_order = transformers.get('final_feature_order', [])
            self.categorical_indices = transformers.get('categorical_indices', [])
            
            # NEW: Load model shape information
            self.expected_model_shape = transformers.get('expected_model_shape')
            self.actual_output_shape = transformers.get('actual_output_shape')
            
            # Load time series specific components if applicable
            if transformers.get('model_category') == 'time_series':
                self.time_series_sequence_mode = transformers.get('time_series_sequence_mode')
                self.window_size = transformers.get('window_size')
                self.step_size = transformers.get('step_size')
                self.horizon = transformers.get('horizon')
                self.max_sequence_length = transformers.get('max_sequence_length')
                self.global_target_lengths = transformers.get('global_target_lengths', {})
                
                # Log loaded components
                self.logger.debug(f"Loaded time series parameters: mode={self.time_series_sequence_mode}, "
                                f"window_size={self.window_size}, step_size={self.step_size}, horizon={self.horizon}")
                
                # NEW: Log shape information if available
                if self.expected_model_shape:
                    self.logger.debug(f"Loaded expected model input shape: {self.expected_model_shape}")
                if self.actual_output_shape:
                    self.logger.debug(f"Loaded actual preprocessor output shape: {self.actual_output_shape}")
            
            self.logger.info(f"Transformers loaded successfully from '{transformers_path}'.")
            return transformers
        
        except Exception as e:
            self.logger.error(f"❌ Failed to load transformers: {e}")
            raise



    def determine_n_neighbors(self, minority_count: int, default_neighbors: int = 5) -> int:
        """
        Determine the appropriate number of neighbors for SMOTE based on minority class size.

        Args:
            minority_count (int): Number of samples in the minority class.
            default_neighbors (int): Default number of neighbors to use if possible.

        Returns:
            int: Determined number of neighbors for SMOTE.
        """
        if minority_count <= 1:
            raise ValueError("SMOTE cannot be applied when the minority class has less than 2 samples.")
        
        # Ensure n_neighbors does not exceed minority_count - 1
        n_neighbors = min(default_neighbors, minority_count - 1)
        return n_neighbors

    def implement_smote(self, X_train: pd.DataFrame, y_train: pd.Series) -> Tuple[pd.DataFrame, pd.Series]:
        """
        Implement SMOTE or its variants based on class imbalance with automated n_neighbors selection.

        Args:
            X_train (pd.DataFrame): Training features (transformed).
            y_train (pd.Series): Training target.

        Returns:
            Tuple[pd.DataFrame, pd.Series]: Resampled X_train and y_train.
        """
        step_name = "Implement SMOTE (Train Only)"
        self.logger.info(f"Step: {step_name}")

        # Check if classification
        if self.model_category != 'classification':
            self.logger.info("SMOTE not applicable: Not a classification model.")
            self.preprocessing_steps.append("SMOTE Skipped")
            return X_train, y_train

        # Calculate class distribution
        class_counts = y_train.value_counts()
        if len(class_counts) < 2:
            self.logger.warning("SMOTE not applicable: Only one class present.")
            self.preprocessing_steps.append("SMOTE Skipped")
            return X_train, y_train

        majority_class = class_counts.idxmax()
        minority_class = class_counts.idxmin()
        majority_count = class_counts.max()
        minority_count = class_counts.min()
        imbalance_ratio = minority_count / majority_count
        self.logger.info(f"Class Distribution before SMOTE: {class_counts.to_dict()}")
        self.logger.info(f"Imbalance Ratio (Minority/Majority): {imbalance_ratio:.4f}")

        # Determine SMOTE variant based on dataset composition
        has_numericals = len(self.numericals) > 0
        has_categoricals = len(self.ordinal_categoricals) + len(self.nominal_categoricals) > 0

        # Automatically select SMOTE variant
        if has_numericals and has_categoricals:
            smote_variant = 'SMOTENC'
            self.logger.info("Dataset contains both numerical and categorical features. Using SMOTENC.")
        elif has_numericals and not has_categoricals:
            smote_variant = 'SMOTE'
            self.logger.info("Dataset contains only numerical features. Using SMOTE.")
        elif has_categoricals and not has_numericals:
            smote_variant = 'SMOTEN'
            self.logger.info("Dataset contains only categorical features. Using SMOTEN.")
        else:
            smote_variant = 'SMOTE'  # Fallback
            self.logger.info("Feature composition unclear. Using SMOTE as default.")

        # Initialize SMOTE based on the variant
        try:
            if smote_variant == 'SMOTENC':
                if not self.categorical_indices:
                    # Determine categorical indices if not already set
                    categorical_features = []
                    for name, transformer, features in self.pipeline.transformers_:
                        if 'ord' in name or 'nominal' in name:
                            if isinstance(transformer, Pipeline):
                                encoder = transformer.named_steps.get('ordinal_encoder') or transformer.named_steps.get('onehot_encoder')
                                if hasattr(encoder, 'categories_'):
                                    # Calculate indices based on transformers order
                                    # This can be complex; for simplicity, assuming categorical features are the first
                                    categorical_features.extend(range(len(features)))
                    self.categorical_indices = categorical_features
                    self.logger.debug(f"Categorical feature indices for SMOTENC: {self.categorical_indices}")
                n_neighbors = self.determine_n_neighbors(minority_count, default_neighbors=5)
                smote = SMOTENC(categorical_features=self.categorical_indices, random_state=42, k_neighbors=n_neighbors)
                self.logger.debug(f"Initialized SMOTENC with categorical features indices: {self.categorical_indices} and n_neighbors={n_neighbors}")
            elif smote_variant == 'SMOTEN':
                n_neighbors = self.determine_n_neighbors(minority_count, default_neighbors=5)
                smote = SMOTEN(random_state=42, n_neighbors=n_neighbors)
                self.logger.debug(f"Initialized SMOTEN with n_neighbors={n_neighbors}")
            else:
                n_neighbors = self.determine_n_neighbors(minority_count, default_neighbors=5)
                smote = SMOTE(random_state=42, k_neighbors=n_neighbors)
                self.logger.debug(f"Initialized SMOTE with n_neighbors={n_neighbors}")
        except ValueError as ve:
            self.logger.error(f"❌ SMOTE initialization failed: {ve}")
            raise
        except Exception as e:
            self.logger.error(f"❌ Unexpected error during SMOTE initialization: {e}")
            raise

        # Apply SMOTE
        try:
            X_resampled, y_resampled = smote.fit_resample(X_train, y_train)
            self.logger.info(f"Applied {smote_variant}. Resampled dataset shape: {X_resampled.shape}")
            self.preprocessing_steps.append("Implement SMOTE")
            self.smote = smote  # Assign to self for saving
            self.logger.debug(f"Selected n_neighbors for SMOTE: {n_neighbors}")
            return X_resampled, y_resampled
        except Exception as e:
            self.logger.error(f"❌ SMOTE application failed: {e}")
            raise

    def inverse_transform_data(self, X_transformed: np.ndarray, original_data: Optional[pd.DataFrame] = None) -> pd.DataFrame:
        """
        Perform inverse transformation on the transformed data to reconstruct original feature values.

        Args:
            X_transformed (np.ndarray): The transformed feature data.
            original_data (Optional[pd.DataFrame]): The original data before transformation.

        Returns:
            pd.DataFrame: The inverse-transformed DataFrame including passthrough columns.
        """
        if self.pipeline is None:
            self.logger.error("Preprocessing pipeline has not been fitted. Cannot perform inverse transformation.")
            raise AttributeError("Preprocessing pipeline has not been fitted. Cannot perform inverse transformation.")

        preprocessor = self.pipeline
        logger = logging.getLogger('InverseTransform')
        if self.debug or self.get_debug_flag('debug_final_inverse_transformations'):
            logger.setLevel(logging.DEBUG)
        else:
            logger.setLevel(logging.INFO)

        logger.debug(f"[DEBUG Inverse] Starting inverse transformation. Input shape: {X_transformed.shape}")

        # Initialize variables
        inverse_data = {}
        transformations_applied = False  # Flag to check if any transformations are applied
        start_idx = 0  # Starting index for slicing

        # Iterate over each transformer in the ColumnTransformer
        for name, transformer, features in preprocessor.transformers_:
            if name == 'remainder':
                logger.debug(f"[DEBUG Inverse] Skipping 'remainder' transformer (passthrough columns).")
                continue  # Skip passthrough columns

            end_idx = start_idx + len(features)
            logger.debug(f"[DEBUG Inverse] Transformer '{name}' handling features {features} with slice {start_idx}:{end_idx}")

            # Check if the transformer has an inverse_transform method
            if hasattr(transformer, 'named_steps'):
                # Access the last step in the pipeline (e.g., scaler or encoder)
                last_step = list(transformer.named_steps.keys())[-1]
                inverse_transformer = transformer.named_steps[last_step]

                if hasattr(inverse_transformer, 'inverse_transform'):
                    transformed_slice = X_transformed[:, start_idx:end_idx]
                    inverse_slice = inverse_transformer.inverse_transform(transformed_slice)

                    # Assign inverse-transformed data to the corresponding feature names
                    for idx, feature in enumerate(features):
                        inverse_data[feature] = inverse_slice[:, idx]

                    logger.debug(f"[DEBUG Inverse] Applied inverse_transform on transformer '{last_step}' for features {features}.")
                    transformations_applied = True
                else:
                    logger.debug(f"[DEBUG Inverse] Transformer '{last_step}' does not support inverse_transform. Skipping.")
            else:
                logger.debug(f"[DEBUG Inverse] Transformer '{name}' does not have 'named_steps'. Skipping.")

            start_idx = end_idx  # Update starting index for next transformer

        # Convert the inverse_data dictionary to a DataFrame
        if transformations_applied:
            inverse_df = pd.DataFrame(inverse_data, index=original_data.index if original_data is not None else None)
            logger.debug(f"[DEBUG Inverse] Inverse DataFrame shape (transformed columns): {inverse_df.shape}")
            logger.debug(f"[DEBUG Inverse] Sample of inverse-transformed data:\n{inverse_df.head()}")
        else:
            if original_data is not None:
                logger.warning("⚠️ No reversible transformations were applied. Returning original data.")
                inverse_df = original_data.copy()
                logger.debug(f"[DEBUG Inverse] Returning a copy of original_data with shape: {inverse_df.shape}")
            else:
                logger.error("❌ No transformations were applied and original_data was not provided. Cannot perform inverse transformation.")
                raise ValueError("No transformations were applied and original_data was not provided.")

        # Identify passthrough columns by excluding transformed features
        if original_data is not None and transformations_applied:
            transformed_features = set(inverse_data.keys())
            all_original_features = set(original_data.columns)
            passthrough_columns = list(all_original_features - transformed_features)
            logger.debug(f"[DEBUG Inverse] Inverse DataFrame columns before pass-through merge: {inverse_df.columns.tolist()}")
            logger.debug(f"[DEBUG Inverse] all_original_features: {list(all_original_features)}")
            logger.debug(f"[DEBUG Inverse] passthrough_columns: {passthrough_columns}")

            if passthrough_columns:
                logger.debug(f"[DEBUG Inverse] Passthrough columns to merge: {passthrough_columns}")
                passthrough_data = original_data[passthrough_columns].copy()
                inverse_df = pd.concat([inverse_df, passthrough_data], axis=1)

                # Ensure the final DataFrame has the same column order as original_data
                inverse_df = inverse_df[original_data.columns]
                logger.debug(f"[DEBUG Inverse] Final inverse DataFrame shape: {inverse_df.shape}")
                
                # Check for missing columns after inverse transform
                expected_columns = set(original_data.columns)
                final_columns = set(inverse_df.columns)
                missing_after_inverse = expected_columns - final_columns

                if missing_after_inverse:
                    err_msg = (
                    f"Inverse transform error: The following columns are missing "
                    f"after inverse transform: {missing_after_inverse}"
                    )
                    logger.error(err_msg)
                    raise ValueError(err_msg)
            else:
                logger.debug("[DEBUG Inverse] No passthrough columns to merge.")
        else:
            logger.debug("[DEBUG Inverse] Either no original_data provided or no transformations were applied.")

        return inverse_df



    def build_pipeline(self, X_train: pd.DataFrame) -> ColumnTransformer:
        transformers = []

        # Handle Numerical Features
        if self.numericals:
            numerical_strategy = self.options.get('handle_missing_values', {}).get('numerical_strategy', {}).get('strategy', 'median')
            numerical_imputer = self.options.get('handle_missing_values', {}).get('numerical_strategy', {}).get('imputer', 'SimpleImputer')

            if numerical_imputer == 'SimpleImputer':
                num_imputer = SimpleImputer(strategy=numerical_strategy)
            elif numerical_imputer == 'KNNImputer':
                knn_neighbors = self.options.get('handle_missing_values', {}).get('numerical_strategy', {}).get('knn_neighbors', 5)
                num_imputer = KNNImputer(n_neighbors=knn_neighbors)
            else:
                raise ValueError(f"Unsupported numerical imputer type: {numerical_imputer}")

            # Determine scaling method
            scaling_method = self.options.get('apply_scaling', {}).get('method', None)
            if scaling_method is None:
                # Default scaling based on model category
                if self.model_category in ['regression', 'classification', 'clustering']:
                    # For clustering, MinMaxScaler is generally preferred
                    if self.model_category == 'clustering':
                        scaler = MinMaxScaler()
                        scaling_type = 'MinMaxScaler'
                    else:
                        scaler = StandardScaler()
                        scaling_type = 'StandardScaler'
                else:
                    scaler = 'passthrough'
                    scaling_type = 'None'
            else:
                # Normalize the scaling_method string to handle case-insensitivity
                scaling_method_normalized = scaling_method.lower()
                if scaling_method_normalized == 'standardscaler':
                    scaler = StandardScaler()
                    scaling_type = 'StandardScaler'
                elif scaling_method_normalized == 'minmaxscaler':
                    scaler = MinMaxScaler()
                    scaling_type = 'MinMaxScaler'
                elif scaling_method_normalized == 'robustscaler':
                    scaler = RobustScaler()
                    scaling_type = 'RobustScaler'
                elif scaling_method_normalized == 'none':
                    scaler = 'passthrough'
                    scaling_type = 'None'
                else:
                    raise ValueError(f"Unsupported scaling method: {scaling_method}")

            numerical_transformer = Pipeline(steps=[
                ('imputer', num_imputer),
                ('scaler', scaler)
            ])

            transformers.append(('num', numerical_transformer, self.numericals))
            self.logger.debug(f"Numerical transformer added with imputer '{numerical_imputer}' and scaler '{scaling_type}'.")

        # Handle Ordinal Categorical Features
        if self.ordinal_categoricals:
            ordinal_strategy = self.options.get('encode_categoricals', {}).get('ordinal_encoding', 'OrdinalEncoder')
            if ordinal_strategy == 'OrdinalEncoder':
                ordinal_transformer = Pipeline(steps=[
                    ('imputer', SimpleImputer(strategy='most_frequent')),
                    # Updated OrdinalEncoder to handle unknown categories gracefully
                    ('ordinal_encoder', OrdinalEncoder(handle_unknown='use_encoded_value', unknown_value=-1))
                ])
                transformers.append(('ord', ordinal_transformer, self.ordinal_categoricals))
                self.logger.debug("Ordinal transformer added with OrdinalEncoder (handling unknown categories with -1).")
            else:
                raise ValueError(f"Unsupported ordinal encoding strategy: {ordinal_strategy}")

        # Handle Nominal Categorical Features
        if self.nominal_categoricals:
            nominal_strategy = self.options.get('encode_categoricals', {}).get('nominal_encoding', 'OneHotEncoder')
            if nominal_strategy == 'OneHotEncoder':
                nominal_transformer = Pipeline(steps=[
                    ('imputer', SimpleImputer(strategy='most_frequent')),
                    ('onehot_encoder', OneHotEncoder(handle_unknown='ignore', sparse_output=False))
                ])
                transformers.append(('nominal', nominal_transformer, self.nominal_categoricals))
                self.logger.debug("Nominal transformer added with OneHotEncoder.")
            elif nominal_strategy == 'OrdinalEncoder':
                nominal_transformer = Pipeline(steps=[
                    ('imputer', SimpleImputer(strategy='most_frequent')),
                    ('ordinal_encoder', OrdinalEncoder(handle_unknown='use_encoded_value', unknown_value=-1))
                ])
                transformers.append(('nominal_ord', nominal_transformer, self.nominal_categoricals))
                self.logger.debug("Nominal transformer added with OrdinalEncoder.")
            elif nominal_strategy == 'FrequencyEncoder':
                # Implement custom Frequency Encoding
                for feature in self.nominal_categoricals:
                    freq = X_train[feature].value_counts(normalize=True)
                    X_train[feature] = X_train[feature].map(freq)
                    self.feature_reasons[feature] += 'Frequency Encoding applied | '
                    self.logger.debug(f"Frequency Encoding applied to '{feature}'.")
            else:
                raise ValueError(f"Unsupported nominal encoding strategy: {nominal_strategy}")

        if not transformers and 'FrequencyEncoder' not in nominal_strategy:
            self.logger.error("No transformers added to the pipeline. Check feature categorization and configuration.")
            raise ValueError("No transformers added to the pipeline. Check feature categorization and configuration.")

        preprocessor = ColumnTransformer(transformers=transformers, remainder='passthrough')
        self.logger.debug("ColumnTransformer constructed with the following transformers:")
        for t in transformers:
            self.logger.debug(t)

        preprocessor.fit(X_train)
        self.logger.info("✅ Preprocessor fitted on training data.")

        # Determine categorical feature indices for SMOTENC if needed
        if self.options.get('implement_smote', {}).get('variant', None) == 'SMOTENC':
            if not self.categorical_indices:
                categorical_features = []
                for name, transformer, features in preprocessor.transformers_:
                    if 'ord' in name or 'nominal' in name:
                        if isinstance(transformer, Pipeline):
                            encoder = transformer.named_steps.get('ordinal_encoder') or transformer.named_steps.get('onehot_encoder')
                            if hasattr(encoder, 'categories_'):
                                # Calculate indices based on transformers order
                                # This can be complex; for simplicity, assuming categorical features are the first
                                categorical_features.extend(range(len(features)))
                self.categorical_indices = categorical_features
                self.logger.debug(f"Categorical feature indices for SMOTENC: {self.categorical_indices}")

        return preprocessor





    def check_target_alignment(self, X_seq: Any, y_seq: Any) -> bool:
        """
        Verify that for each sequence the target length matches expectations.
        For 'set_window' mode, the target should have self.horizon rows;
        for other modes, it is assumed the target length should match the sequence length.
        """
        for idx, (seq, target) in enumerate(zip(X_seq, y_seq)):
            seq_length = seq.shape[0] if hasattr(seq, 'shape') else len(seq)
            # For set_window mode, we expect the target to have self.horizon rows.
            # For other modes, we expect the target length to match the sequence length.
            if self.time_series_sequence_mode == "set_window":
                expected_length = self.horizon
            else:
                expected_length = seq_length

            actual_length = target.shape[0] if hasattr(target, 'shape') else len(target)
            self.logger.debug(
                f"Sequence {idx}: sequence length = {seq_length}, expected target length = {expected_length}, actual target length = {actual_length}"
            )
            if actual_length != expected_length:
                self.logger.error(
                    f"Alignment error in sequence {idx}: expected target length {expected_length} but got {actual_length}"
                )
                return False
        return True




    def get_phase_order(self) -> List[str]:
        """Return phase order with mode-specific behavior."""
        predefined_order = ["windup", "arm_cocking", "arm_acceleration", "follow_through"]
        
        # For prediction mode, be more flexible: use only phases actually detected in the data.
        if self.mode == 'predict':
            if hasattr(self, 'global_target_lengths') and self.global_target_lengths:
                detected_phases = list(self.global_target_lengths.keys())
                self.logger.info(f"Using detected phases for prediction: {detected_phases}")
                return sorted(detected_phases)
        
        # Otherwise, if we have global target lengths from alignment, sort them with predefined order first.
        if hasattr(self, 'global_target_lengths') and self.global_target_lengths:
            detected_phases = list(self.global_target_lengths.keys())
            ordered = sorted(
                detected_phases,
                key=lambda x: (predefined_order.index(x) if x in predefined_order else len(predefined_order), x)
            )
            return ordered
        
        # Fallback: use the secondary grouping configuration if available.
        return self.sequence_dtw_or_pad_categorical or []



    def reassemble_phases(self, aligned_phases: Dict) -> Tuple[Dict, Dict]:
        """
        Concatenate the aligned phase arrays for each group along the temporal axis.
        This function:
        - Checks that all phases in the expected order (from get_phase_order) are present.
        - If any are missing, it raises an error with details.
        - Orders the phases and concatenates them along axis=0.
        - Records metadata (individual phase lengths, total features).
        
        Args:
            aligned_phases (dict): Dictionary mapping group keys to dictionaries of aligned phase arrays.
        
        Returns:
            Tuple[Dict, Dict]: A dictionary of final sequences and metadata.
        """
        phase_order = self.get_phase_order()
        final_seqs = {}
        metadata = {}
        for group_key, phases in aligned_phases.items():
            # Check if all expected phases are present.
            missing = set(phase_order) - set(phases.keys())
            if missing:
                self.logger.error(f"Group {group_key} is missing phases: {missing}")
                raise ValueError(f"Missing phases {missing} in group {group_key}")
            
            # Order the phases accordingly.
            ordered_phases = [phases[name] for name in phase_order]
            # Concatenate along time axis (axis=0)
            full_seq = np.concatenate(ordered_phases, axis=0)
            metadata[group_key] = {
                "phase_lengths": [arr.shape[0] for arr in ordered_phases],
                "total_features": full_seq.shape[1]
            }
            final_seqs[group_key] = full_seq
            self.logger.debug(
                f"Group {group_key} reassembled: shape {full_seq.shape} "
                f"(Phase lengths: {metadata[group_key]['phase_lengths']})"
            )
        return final_seqs, metadata


    def validate_temporal_integrity(self, final_seqs: Dict, metadata: Dict):
        """
        For each group, verify that the concatenated sequence length equals the sum of individual phase lengths.
        Raises a ValueError if a mismatch is found.
        """
        for group_key, seq in final_seqs.items():
            expected_length = sum(metadata[group_key]["phase_lengths"])
            if seq.shape[0] != expected_length:
                raise ValueError(
                    f"Group {group_key}: Expected length {expected_length}, got {seq.shape[0]}. "
                    f"Phase lengths: {metadata[group_key]['phase_lengths']}"
                )


    def validate_feature_space(self, final_seqs: Dict):
        """
        Ensure that all final sequences have the same number of features.
        """
        if not final_seqs:
            self.logger.warning("No sequences to validate. Skipping feature space validation.")
            return
        
        base_features = next(iter(final_seqs.values())).shape[1]
        for group_key, seq in final_seqs.items():
            if seq.shape[1] != base_features:
                raise ValueError(
                    f"Group {group_key}: Feature dimension mismatch. Expected {base_features}, got {seq.shape[1]}"
                )



    def log_phase_lengths(self, aligned_phases: Dict):
        """
        Logs the dimensions of each phase for each group for debugging purposes.
        """
        for group_key, phases in aligned_phases.items():
            self.logger.debug(f"\nGroup {group_key} phase dimensions:")
            for pname, parr in phases.items():
                self.logger.debug(f"  {pname}: {parr.shape}")
            # Assuming all phases have the same feature dimension:
            any_phase = next(iter(phases.values()))
            self.logger.debug(f"Total features (from a phase): {any_phase[1].shape[1] if isinstance(any_phase, tuple) else any_phase.shape[1]}")


    def sanity_check_concatenation(self, input_phases: List[np.ndarray], output_seq: np.ndarray):
        """
        Perform a sanity check by comparing sample values between the input phases and output sequence.
        Verifies that the very first value of the first phase and the last value of the last phase are preserved.
        """
        phase1_start = input_phases[0][0, 0]
        phaseN_end = input_phases[-1][-1, -1]
        if not np.isclose(output_seq[0, 0], phase1_start):
            raise AssertionError("Start value mismatch in concatenated sequence")
        if not np.isclose(output_seq[-1, -1], phaseN_end):
            raise AssertionError("End value mismatch in concatenated sequence")
        self.logger.debug("Sanity check passed for concatenation.")


    def full_reassembly_pipeline(self, aligned_phases: Dict) -> Tuple[np.ndarray, np.ndarray]:
        """
        Executes the complete reassembly pipeline:
        1. Logs input phase dimensions.
        2. Reassembles phases with reassemble_phases().
        3. Validates temporal integrity and feature consistency.
        4. Performs a sanity check on one sample group.
        5. Returns the final sequences and corresponding group labels.
        
        Args:
            aligned_phases (dict): Dictionary mapping group keys to aligned phase data.
        
        Returns:
            Tuple[np.ndarray, np.ndarray]: Final sequences array and array of group keys.
        """
        self.logger.debug("Starting full reassembly pipeline.")
        # Log the input phase dimensions
        self.logger.debug("Input phase dimensions:")
        self.log_phase_lengths(aligned_phases)
        
        # Reassemble phases
        final_seqs_dict, metadata = self.reassemble_phases(aligned_phases)
        
        # Run validations
        self.validate_temporal_integrity(final_seqs_dict, metadata)
        self.validate_feature_space(final_seqs_dict)

        # Perform a sanity check on one sample group
        sample_group = next(iter(aligned_phases.keys()))
        sample_phases = [aligned_phases[sample_group][p] for p in self.get_phase_order()]
        self.sanity_check_concatenation(sample_phases, final_seqs_dict[sample_group])
        
        # Convert the final sequences to arrays
        group_keys = list(final_seqs_dict.keys())
        sequences = np.array([final_seqs_dict[gk] for gk in group_keys])
        return sequences, np.array(group_keys)

    def apply_psi_feature_selection(self, data: pd.DataFrame) -> pd.DataFrame:
        """
        Apply PSI-based feature selection using feature-engine's DropHighPSIFeatures.
        
        Args:
            data (pd.DataFrame): Input data with features and datetime column
            
        Returns:
            pd.DataFrame: Data with high PSI features dropped
        """
        # Get PSI configuration from options
        psi_config = self.options.get('psi_feature_selection', {})
        if not psi_config.get('enabled', False):
            self.logger.info("PSI-based feature selection is disabled. Skipping.")
            return data
        
        # Extract configuration parameters
        psi_threshold = psi_config.get('threshold', 0.25)
        split_frac = psi_config.get('split_frac', 0.75)
        split_distinct = psi_config.get('split_distinct', False)
        cut_off = psi_config.get('cut_off', None)
        
        # Check if we have a time column (required for chronological splitting)
        if self.time_column is None or self.time_column not in data.columns:
            self.logger.warning("No time column specified or found. Cannot perform PSI-based feature selection.")
            return data
        
        # Determine which features to analyze
        features_to_analyze = [col for col in data.columns 
                            if col not in self.y_variable and col != self.time_column]
        
        try:
            # Initialize the PSI transformer
            psi_transformer = DropHighPSIFeatures(
                variables=features_to_analyze,
                split_col=self.time_column,
                split_frac=split_frac,
                split_distinct=split_distinct,
                threshold=psi_threshold,
                cut_off=cut_off
            )
            
            # Fit and transform to drop high PSI features
            data_reduced = psi_transformer.fit_transform(data)
            
            # Log the results
            dropped_features = set(features_to_analyze) - set(data_reduced.columns)
            if dropped_features:
                self.logger.info(f"Dropped {len(dropped_features)} features with high PSI values: {dropped_features}")
                
                # Update feature reasons dictionary
                for feature in dropped_features:
                    if feature in self.feature_reasons:
                        self.feature_reasons[feature] += f"Dropped due to high PSI value (threshold={psi_threshold}) | "
                        
                # Store PSI values for reporting
                self.psi_values = psi_transformer.psi_values_
            
            return data_reduced
        
        except Exception as e:
            self.logger.error(f"Error during PSI-based feature selection: {e}")
            self.logger.warning("Continuing without PSI-based feature selection.")
            return data



    def split_time_series(self, data: pd.DataFrame, test_size: float = 0.2, random_state: int = 42) -> Tuple[pd.DataFrame, pd.DataFrame]:
        """
        Split time series data respecting temporal ordering, with chronological split.
        
        Args:
            data (pd.DataFrame): Time series data to split
            test_size (float): Proportion of data to use for testing (0.0 to 1.0)
            random_state (int): Random seed for reproducibility (unused in chronological splitting)
            
        Returns:
            Tuple[pd.DataFrame, pd.DataFrame]: Training data, testing data
        """
        # Check if we have a time column
        if self.time_column is None or self.time_column not in data.columns:
            self.logger.warning("No time column specified or found. Using index-based splitting.")
            split_idx = int((1 - test_size) * len(data))
            return data.iloc[:split_idx], data.iloc[split_idx:]
        
        # Sort by time
        data_sorted = data.sort_values(by=self.time_column)
        
        # Use chronological splitting
        split_idx = int((1 - test_size) * len(data_sorted))
        self.logger.info(f"Performing chronological split at index {split_idx} (test_size={test_size})")
        return data_sorted.iloc[:split_idx], data_sorted.iloc[split_idx:]


    def split_with_feature_engine(self, data: pd.DataFrame) -> Tuple[pd.DataFrame, pd.DataFrame]:
        """
        Split time series data using feature-engine's built-in splitting functionality.
        
        Args:
            data (pd.DataFrame): Time series data to split
            
        Returns:
            Tuple[pd.DataFrame, pd.DataFrame]: Reference data (train), Test data (test)
        """
        # Get configuration for feature-engine splitting
        fe_config = self.options.get('feature_engine_split', {})
        
        # Extract configuration parameters
        split_frac = fe_config.get('split_frac', 0.75)
        split_distinct = fe_config.get('split_distinct', False)
        cut_off = fe_config.get('cut_off', None)
        
        # Check if we have a time column (required for chronological splitting)
        if self.time_column is None or self.time_column not in data.columns:
            self.logger.warning("No time column specified or found. Cannot use feature-engine splitting.")
            return self.split_time_series(data)
        
        try:
            # Create a temporary transformer just for splitting
            # Setting threshold=1.0 means no features will be dropped based on PSI
            splitter = DropHighPSIFeatures(
                variables=[],  # Empty list means no PSI calculation
                split_col=self.time_column,
                split_frac=split_frac,
                split_distinct=split_distinct,
                threshold=1.0,  # Set high to prevent dropping features
                cut_off=cut_off
            )
            
            # Extract reference and test sets without transforming the data
            splitter.fit(data)
            
            # Access the reference and test sets directly from the transformer
            reference_set = splitter.reference_set_.copy()
            test_set = splitter.test_set_.copy()
            
            self.logger.info(f"Split data using feature-engine: reference_set shape={reference_set.shape}, test_set shape={test_set.shape}")
            
            return reference_set, test_set
        
        except Exception as e:
            self.logger.error(f"Error during feature-engine splitting: {e}")
            self.logger.warning("Falling back to standard time series splitting.")
            return self.split_time_series(data)

    def process_set_window(self, data: pd.DataFrame) -> Tuple[np.ndarray, np.ndarray]:
        """
        Process data using the set_window (sliding window) approach.
        
        Args:
            data (pd.DataFrame): Data to process
            
        Returns:
            Tuple[np.ndarray, np.ndarray]: X sequences, y sequences
        """
        # Split features and target
        X = data.drop(columns=self.y_variable)
        y = data[self.y_variable]
        
        # Debug: Log unique values in each categorical column (both ordinal and nominal)
        self.logger.debug(f"process_set_window called with data shape: {data.shape}")
        for col in self.ordinal_categoricals + self.nominal_categoricals:
            if col in X.columns:
                unique_vals = sorted(X[col].unique())
                self.logger.debug(f"Column '{col}' unique values: {unique_vals}")
        #columns during build_pipeline:
        if self.ordinal_categoricals:
            self.logger.debug(f"Building ordinal encoding for columns: {self.ordinal_categoricals}")
            for idx, col in enumerate(self.ordinal_categoricals):
                self.logger.debug(f"Column index {idx}: {col}")

        # Build and apply preprocessing pipeline
        if not hasattr(self, 'pipeline') or self.pipeline is None:
            self.pipeline = self.build_pipeline(X)
            self.logger.debug("Created new pipeline and fitting on data.")

            # -----------------------------------------------------------------
            # ENHANCED DEBUG LOGGING
            # -----------------------------------------------------------------
            if hasattr(self, 'pipeline') and self.pipeline is not None:
                # The pipeline is either newly built or existing
                # We check each transformer in the pipeline to see if it has an ordinal encoder
                for name, transformer, columns in self.pipeline.transformers_:
                    if 'ord' in name:  # or however you identify the ordinal transformer
                        if hasattr(transformer, 'named_steps') and 'ordinal_encoder' in transformer.named_steps:
                            encoder = transformer.named_steps['ordinal_encoder']
                            if hasattr(encoder, 'categories_'):
                                self.logger.debug("Trained categories for ordinal columns:")
                                for i, col in enumerate(columns):
                                    if i < len(encoder.categories_):
                                        self.logger.debug(
                                            f"Column '{col}' - known categories: {encoder.categories_[i]}"
                                        )

    
            X_preprocessed = self.pipeline.fit_transform(X)
        else:
            self.logger.debug("Using existing pipeline for transformation.")
            try:
                X_preprocessed = self.pipeline.transform(X)
            except ValueError as e:
                # Enhanced error handling for unknown categories
                if "unknown categories" in str(e):
                    # Attempt to extract the column index from the error message
                    # Example message: "Found unknown categories [...] in column 0..."
                    # We'll parse the column index from the error string if possible.
                    try:
                        col_index_str = str(e).split("column ")[1].split(" ")[0]
                        col_index = int(col_index_str)
                        column_name = (self.ordinal_categoricals[col_index]
                                    if col_index < len(self.ordinal_categoricals)
                                    else f"unknown({col_index})")
                        self.logger.error(f"Unknown category error in column '{column_name}': {str(e)}")
                    except Exception:
                        # If parsing fails, just log the original error
                        self.logger.error(f"Unknown category error: {str(e)}")
                else:
                    self.logger.error(f"Pipeline transformation error: {str(e)}")
                raise


        X_seq, y_seq = [], []
        
        # Create sliding windows
        for i in range(0, len(X_preprocessed) - self.window_size - self.horizon + 1, self.step_size):
            seq_X = X_preprocessed[i:i+self.window_size]
            seq_y = y.iloc[i+self.window_size:i+self.window_size+self.horizon].values
            
            # Apply sequence length constraints if specified
            if self.max_sequence_length and seq_X.shape[0] < self.max_sequence_length:
                pad_width = self.max_sequence_length - seq_X.shape[0]
                seq_X = np.pad(seq_X, ((0, pad_width), (0, 0)), mode='constant', constant_values=0)
                    
            X_seq.append(seq_X)
            y_seq.append(seq_y)
        
        self.logger.debug(f"Created {len(X_seq)} sequences using set_window mode.")
        
        return np.array(X_seq), np.array(y_seq)



    def process_dtw_or_pad(self, data: pd.DataFrame) -> Tuple[np.ndarray, np.ndarray]:
        """
        Process data using DTW or pad alignment.
        Groups the data by top-level sequence columns, computes global target lengths,
        aligns subphases, and then reassembles them.
        In prediction mode, missing phases are replaced with zero–filled placeholders.
        """
        if not self.sequence_categorical:
            raise ValueError("DTW/pad mode requires sequence_categorical to be specified")
        
        # Split features and target.
        X = data.drop(columns=self.y_variable)
        y = data[self.y_variable]
        
        # Build and apply preprocessing pipeline.
        if not hasattr(self, 'pipeline') or self.pipeline is None:
            self.pipeline = self.build_pipeline(X)
            X_preprocessed = self.pipeline.fit_transform(X)
        else:
            X_preprocessed = self.pipeline.transform(X)
        
        # Group the data.
        grouped = self._group_top_level(data)
        
        # First Pass: Compute global target lengths.
        if not hasattr(self, 'global_target_lengths') or not self.global_target_lengths:
            self.global_target_lengths = {}
        global_target_lengths = self.global_target_lengths.copy()
        
        for group_key, group_data in grouped:
            # Extra debug logging for the problematic group.
            if group_key == (2757.0, 26.0):
                if 'pitch_phase_biomech' in group_data.columns:
                    raw_phases = group_data['pitch_phase_biomech'].unique()
                    self.logger.debug("=== DETAILED ANALYSIS OF PROBLEM GROUP (2757.0, 26.0) ===")
                    self.logger.debug(f"Raw phases present: {raw_phases}")
                    for phase in raw_phases:
                        norm = self.normalize_phase_key(str(phase))
                        self.logger.debug(f"Phase normalization: '{phase}' → '{norm}'")
                    for phase in raw_phases:
                        count = len(group_data[group_data['pitch_phase_biomech'] == phase])
                        self.logger.debug(f"Phase '{phase}' has {count} samples")
                else:
                    self.logger.debug("No 'pitch_phase_biomech' column found in problem group!")
                    self.logger.debug(f"Available columns: {group_data.columns.tolist()}")
            
            subphases = self._segment_subphases(group_data, skip_min_samples=True)
            for phase, (phase_name, phase_array) in subphases.items():
                current_len = phase_array.shape[0]
                global_target_lengths[phase] = max(global_target_lengths.get(phase, 0), current_len)
        
        # Update global target lengths.
        self.global_target_lengths = global_target_lengths
        self.logger.debug(f"Global target lengths per phase: {global_target_lengths}")
        
        self.logger.debug(f"All detected phases: {set(global_target_lengths.keys())}")
        self.logger.debug(f"Expected phases: {set(self.get_phase_order())}")
        self.logger.debug(f"Missing globally: {set(self.get_phase_order()) - set(global_target_lengths.keys())}")
                
        # Second Pass: Align each subphase using the global targets.
        aligned_groups = {}
        for group_key, group_data in grouped:
            subphases = self._segment_subphases(group_data)
            aligned_subphases = {}
            for phase, (phase_name, phase_array) in subphases.items():
                target = global_target_lengths.get(phase, phase_array.shape[0])
                try:
                    aligned = self._align_phase(phase_array, target, phase_name=phase_name)
                    aligned_subphases[phase] = aligned
                except Exception as e:
                    self.logger.error(f"Alignment failed for group {group_key}, phase {phase}: {e}")
                    aligned_subphases[phase] = None
            
            # Check if all expected phases are present.
            missing = set(self.get_phase_order()) - set(aligned_subphases.keys())
            if missing:
                if self.mode == 'predict':
                    self.logger.warning(f"Group {group_key} missing phases: {missing}, attempting to continue anyway")
                    # Create placeholder arrays for each missing phase.
                    for missing_phase in missing:
                        if aligned_subphases:
                            sample_phase = next((val for val in aligned_subphases.values() if val is not None), None)
                            if sample_phase is None:
                                self.logger.error("Cannot create placeholder: No valid phase found in group.")
                                continue
                            feature_dim = sample_phase.shape[1]
                            target_len = global_target_lengths.get(missing_phase, 10)  # Default length if not available.
                            placeholder = np.zeros((target_len, feature_dim))
                            aligned_subphases[missing_phase] = placeholder
                            self.logger.info(f"Created placeholder for missing '{missing_phase}' with shape {placeholder.shape}")
                        else:
                            self.logger.error("Cannot create placeholder: No valid phases available in group.")
                    still_missing = set(self.get_phase_order()) - set(aligned_subphases.keys())
                    if still_missing:
                        self.logger.error(f"Group {group_key} still missing phases after placeholder creation: {still_missing}")
                        self.logger.warning(f"Skipping group {group_key} due to incomplete phase placeholders.")
                    else:
                        aligned_groups[group_key] = aligned_subphases
                else:
                    self.logger.error(f"Group {group_key} invalid. Missing phases: {missing}")
                    self.logger.warning(f"Skipping group {group_key} due to missing phases.")
            elif all(aligned is not None for aligned in aligned_subphases.values()):
                aligned_groups[group_key] = aligned_subphases
            else:
                self.logger.warning(f"Skipping invalid group {group_key}")
        
        # Reassemble phases.
        X_seq, group_labels = self.full_reassembly_pipeline(aligned_groups)
        
        # Extract targets for each group.
        y_seq = []
        for group_key in aligned_groups.keys():
            if isinstance(self.sequence_categorical, list):
                mask = data[self.sequence_categorical].apply(lambda row: tuple(row) == group_key, axis=1)
            else:
                mask = data[self.sequence_categorical] == group_key
            
            group_indices = data[mask].index
            y_group = y.loc[group_indices].values
            
            if len(y_group) < self.horizon:
                self.logger.error(f"Group {group_key} has insufficient target samples for horizon {self.horizon}")
                raise ValueError(f"Insufficient target samples in group {group_key}")
            
            if self.horizon == 1:
                y_seq.append(y_group[-1])
            else:
                y_seq.append(y_group[-self.horizon:].squeeze())
        
        y_seq = np.array(y_seq)
        
        return X_seq, y_seq




    def process_variable_length(self, data: pd.DataFrame) -> Tuple[np.ndarray, np.ndarray]:
        """
        Process data using the variable length approach.
        
        Args:
            data (pd.DataFrame): Data to process
            
        Returns:
            Tuple[np.ndarray, np.ndarray]: X sequences, y sequences
        """
        if not hasattr(self, 'window_size') or self.window_size is None:
            raise ValueError("variable_length mode requires window_size to be specified")
        
        # Split features and target
        X = data.drop(columns=self.y_variable)
        y = data[self.y_variable]
        
        # Build and apply preprocessing pipeline
        if not hasattr(self, 'pipeline') or self.pipeline is None:
            self.pipeline = self.build_pipeline(X)
            X_preprocessed = self.pipeline.fit_transform(X)
        else:
            X_preprocessed = self.pipeline.transform(X)
        
        X_seq, y_seq = [], []
        
        # Create sequences with variable lengths
        for i in range(self.window_size, len(X_preprocessed) - self.horizon + 1):
            # Use all available data up to the current point (variable length)
            seq_X = X_preprocessed[i-self.window_size:i]
            seq_y = y.iloc[i:i+self.horizon].values
            
            X_seq.append(seq_X)
            y_seq.append(seq_y)
        
        self.logger.debug(f"Created {len(X_seq)} sequences using variable_length mode")
        
        return np.array(X_seq), np.array(y_seq)

    
    def _dtw_distance(self, s1, s2, window=None):
        """Calculate DTW distance between two sequences with optional window constraint."""
        n, m = len(s1), len(s2)
        dtw = np.full((n+1, m+1), np.inf)
        dtw[0, 0] = 0
        
        for i in range(1, n+1):
            window_start = max(1, i - window) if window else 1
            window_end = min(m + 1, i + window + 1) if window else m + 1
            
            for j in range(window_start, window_end):
                cost = np.linalg.norm(s1[i-1] - s2[j-1])
                dtw[i, j] = cost + min(dtw[i-1, j], dtw[i, j-1], dtw[i-1, j-1])
        
        return dtw[n, m]
    
    def fit_resample(self, X, y):
        """Generate synthetic samples along the temporal path between samples."""
        X_array = np.array(X)
        y_array = np.array(y)
        
        # Find minority and majority classes
        unique_classes, counts = np.unique(y_array, return_counts=True)
        minority_class = unique_classes[np.argmin(counts)]
        majority_class = unique_classes[np.argmax(counts)]
        
        # Get minority class samples
        minority_indices = np.where(y_array == minority_class)[0]
        
        # Calculate how many synthetic samples to generate
        n_minority = np.sum(y_array == minority_class)
        n_majority = np.sum(y_array == majority_class)
        n_to_generate = n_majority - n_minority
        
        if n_to_generate <= 0 or len(minority_indices) < 2:
            return X_array, y_array
            
        # Find temporal neighbors for each minority sample
        temporal_neighbors = self._find_temporal_neighbors(X_array, minority_indices)
        
        # Generate synthetic samples
        synthetic_samples = []
        synthetic_labels = []
        
        for _ in range(n_to_generate):
            # Randomly select a minority sample
            idx = self.rng.choice(minority_indices)
            
            # Get its temporal neighbors
            neighbors = temporal_neighbors.get(idx, [])
            
            if not neighbors:
                continue
                
            # Select a random temporal neighbor
            neighbor_idx = self.rng.choice(neighbors)
            
            # Generate a weight for interpolation
            alpha = self.rng.random()
            
            # Special handling for different sequence modes
            if self.phase_markers is not None:
                # Respect phase boundaries when generating synthetic samples
                synthetic = self._generate_phase_aware_sample(X_array[idx], X_array[neighbor_idx], alpha)
            elif self.window_size is not None:
                # For set_window mode, preserve window structure
                synthetic = self._generate_window_aware_sample(X_array[idx], X_array[neighbor_idx], alpha)
            else:
                # Generate synthetic sample along the temporal path
                synthetic = X_array[idx] + alpha * (X_array[neighbor_idx] - X_array[idx])
            
            synthetic_samples.append(synthetic)
            synthetic_labels.append(minority_class)
        
        # Combine original and synthetic samples
        if synthetic_samples:
            X_resampled = np.vstack([X_array, np.array(synthetic_samples)])
            y_resampled = np.hstack([y_array, np.array(synthetic_labels)])
            return X_resampled, y_resampled
        else:
            return X_array, y_array
            
    def _generate_phase_aware_sample(self, sample1, sample2, alpha):
        """Generate a synthetic sample respecting phase boundaries."""
        synthetic = np.zeros_like(sample1)
        
        # If phase markers are provided, respect them when generating samples
        if self.phase_markers:
            phase_start = 0
            for i, marker in enumerate(self.phase_markers):
                phase_end = marker if i < len(self.phase_markers) else len(sample1)
                
                # Interpolate within each phase separately
                phase_slice = slice(phase_start, phase_end)
                synthetic[phase_slice] = sample1[phase_slice] + alpha * (sample2[phase_slice] - sample1[phase_slice])
                
                phase_start = phase_end
        else:
            # If no phase markers, use standard interpolation
            synthetic = sample1 + alpha * (sample2 - sample1)
            
        return synthetic
        
    def _generate_window_aware_sample(self, sample1, sample2, alpha):
        """Generate a synthetic sample preserving window structure."""
        synthetic = np.zeros_like(sample1)
        
        # For windowed data, weight more recent time steps more heavily
        for t in range(len(sample1)):
            # Increasing weight for more recent time steps
            time_weight = 0.5 + 0.5 * (t / len(sample1))
            effective_alpha = alpha * time_weight
            
            # Interpolate with time-weighted alpha
            synthetic[t] = sample1[t] + effective_alpha * (sample2[t] - sample1[t])
        
        return synthetic


    def validate_categorical_distributions(self, train_data: pd.DataFrame, test_data: pd.DataFrame) -> None:
        self.logger.info("Validating categorical distributions...")
        for col in self.ordinal_categoricals:
            if col in train_data.columns and col in test_data.columns:
                self.logger.debug(f"Column {col} dtypes - train: {train_data[col].dtype}, test: {test_data[col].dtype}")
                self.logger.debug(f"Sample values - train: {train_data[col].head(3).tolist()}, test: {test_data[col].head(3).tolist()}")

        for idx, col in enumerate(self.ordinal_categoricals):
            if col in train_data.columns and col in test_data.columns:
                train_cats = set(train_data[col].unique())
                test_cats = set(test_data[col].unique())
                
                # Check for test categories not in training
                unseen_cats = test_cats - train_cats
                if unseen_cats:
                    self.logger.warning(f"Column index {idx} ({col}) has unseen categories: {sorted(unseen_cats)}")
                    self.logger.warning(f"Training categories for {col}: {sorted(train_cats)}")
                    
                    # Calculate percentage of test rows affected
                    affected_rows = test_data[test_data[col].isin(unseen_cats)].shape[0]
                    affected_pct = round(100 * affected_rows / test_data.shape[0], 2)
                    self.logger.warning(f"Affects {affected_rows} test rows ({affected_pct}% of test data)")


    def split_dataset_sequence_aware(self, data: pd.DataFrame, split_date: pd.Timestamp) -> Tuple[pd.DataFrame, pd.DataFrame]:
        """
        Split the time series data into training and testing sets while preserving whole sequences.
        Sequences are defined by the 'sequence_categorical' columns. This method groups the data,
        computes each group’s start and end times (using self.time_column), and then chooses a split
        boundary based on a provided split_date. All data belonging to groups before the boundary
        goes to training and all groups after go to testing.

        Args:
            data (pd.DataFrame): The input data (should include the time_column).
            split_date (pd.Timestamp): The desired split date.
        
        Returns:
            Tuple[pd.DataFrame, pd.DataFrame]: (train_data, test_data)
        """
        # Ensure the time_column is in datetime format
        if self.time_column not in data.columns:
            raise ValueError(f"Time column '{self.time_column}' not found in data.")
        data[self.time_column] = pd.to_datetime(data[self.time_column])

        # Group data by the sequence identifier(s)
        groups = self._group_top_level(data)  # returns list of (group_key, group_df)
        if not groups:
            raise ValueError("No groups found for sequence-aware splitting.")

        # Build a list of sequence boundaries: (group_key, min_time, max_time)
        sequence_boundaries = []
        for group_key, group_df in groups:
            start_time = group_df[self.time_column].min()
            end_time = group_df[self.time_column].max()
            sequence_boundaries.append((group_key, start_time, end_time))

        # Sort the groups by their start time
        sequence_boundaries.sort(key=lambda x: x[1])
        
        # Determine the split index by comparing the split_date with each group's boundaries.
        split_idx = 0
        min_distance = float('inf')
        for i, (group_key, start_time, end_time) in enumerate(sequence_boundaries):
            # Check if the split_date falls within the current group
            if start_time <= split_date <= end_time:
                # Decide based on which boundary is closer to the split_date
                start_distance = abs((split_date - start_time).total_seconds())
                end_distance = abs((end_time - split_date).total_seconds())
                split_idx = i if start_distance <= end_distance else i + 1
                break
            else:
                # Otherwise, track the closest group start
                distance = abs((split_date - start_time).total_seconds())
                if distance < min_distance:
                    min_distance = distance
                    split_idx = i
        
        # Create lists of group keys for training and testing
        train_group_keys = [entry[0] for entry in sequence_boundaries[:split_idx]]
        test_group_keys = [entry[0] for entry in sequence_boundaries[split_idx:]]
        
        # Create masks based on whether a row's sequence belongs to train or test groups.
        if isinstance(self.sequence_categorical, list) and len(self.sequence_categorical) > 1:
            # For multiple sequence columns, use tuple of values
            train_mask = data[self.sequence_categorical].apply(lambda row: tuple(row) in train_group_keys, axis=1)
            test_mask = data[self.sequence_categorical].apply(lambda row: tuple(row) in test_group_keys, axis=1)
        else:
            # If a single column is used (or if list with one element)
            key = self.sequence_categorical[0] if isinstance(self.sequence_categorical, list) else self.sequence_categorical
            train_mask = data[key].isin(train_group_keys)
            test_mask = data[key].isin(test_group_keys)

        train_data = data.loc[train_mask].copy()
        test_data = data.loc[test_mask].copy()

        self.logger.info(f"Sequence-aware split: Requested split at {split_date}.")
        if sequence_boundaries:
            # Log the actual boundary used (using the first group in the test set, if available)
            actual_split_time = sequence_boundaries[split_idx][1] if split_idx < len(sequence_boundaries) else sequence_boundaries[-1][2]
            self.logger.info(f"Actual split at nearest sequence boundary: {actual_split_time}")
            self.logger.info(f"Train sequences: {len(train_group_keys)}, Test sequences: {len(test_group_keys)}")
        return train_data, test_data

    def analyze_split_options(self, data: pd.DataFrame) -> list:
        """
        Analyze the available sequence boundaries and return candidate split options.
        For each possible split point (i.e. between sequences), compute the fraction of data
        that would fall in the training set.

        Args:
            data (pd.DataFrame): The input data (must contain the time_column).
        
        Returns:
            List[Dict]: A list of dictionaries with keys: 'split_time', 'train_fraction', and 'test_fraction'.
        """
        if self.time_column not in data.columns:
            raise ValueError(f"Time column '{self.time_column}' not found in data.")
        data[self.time_column] = pd.to_datetime(data[self.time_column])
        groups = self._group_top_level(data)
        if not groups:
            return []
        
        # Compute boundaries and group sizes
        sequence_boundaries = []
        total_samples = len(data)
        cumulative = 0
        options = []
        for group_key, group_df in groups:
            start_time = group_df[self.time_column].min()
            end_time = group_df[self.time_column].max()
            group_size = len(group_df)
            cumulative += group_size
            train_fraction = cumulative / total_samples
            options.append({
                "group_key": group_key,
                "split_time": start_time,
                "train_fraction": train_fraction,
                "test_fraction": 1 - train_fraction
            })
        # Sort options by how close the train_fraction is to any desired value if needed
        return options

    def preprocess_time_series(self, data: pd.DataFrame) -> Tuple[Any, Any, Any, Any, pd.DataFrame, Any]:
        """
        Preprocess data specifically for time series models.
        
        Steps:
        1. Handle missing values and outliers.
        2. Sort data by time.
           - NEW: Log all unique phase values in the dataset.
        3. Split data into training and testing sets.
           - If the time_series_split options specify method "sequence_aware", then:
             a. If a 'split_date' is provided, use it.
             b. Else, if a 'target_train_fraction' is provided, use analyze_split_options
                to determine an appropriate split_date.
             c. Otherwise, raise an error.
        4. Continue with the rest of the time series processing:
           - Grouping, segmentation, alignment, reassembly.
        
        Returns:
            Tuple: (X_seq, X_test_seq, y_seq, y_test_seq, recommendations, extras)
        """
        # 1. Handle missing values and outliers.
        data_clean, _ = self.handle_missing_values(data)
        X_temp = data_clean.drop(columns=self.y_variable)
        y_temp = data_clean[self.y_variable]
        X_temp, y_temp = self.handle_outliers(X_temp, y_temp)
        data_clean = pd.concat([X_temp, y_temp], axis=1)
        
        # 2. Sort data by time column.
        if self.time_column is None:
            raise ValueError("For time series models, 'time_column' must be specified.")
        data_clean[self.time_column] = pd.to_datetime(data_clean[self.time_column])
        data_sorted = data_clean.sort_values(by=self.time_column)
        
        # NEW: Log all unique phase values in the sorted dataset (if the column exists).
        if 'pitch_phase_biomech' in data_sorted.columns:
            all_unique_phases = data_sorted['pitch_phase_biomech'].unique()
            self.logger.debug(f"All unique phase values in dataset: {all_unique_phases}")
        
        # 3. Split the data.
        ts_split_options = self.options.get('time_series_split', {})
        if ts_split_options.get('method') == "sequence_aware":
            # NEW: If no split_date is provided, try to compute one using target_train_fraction.
            if 'split_date' not in ts_split_options:
                if 'target_train_fraction' in ts_split_options:
                    target_fraction = ts_split_options['target_train_fraction']
                    self.logger.info(f"Using target_train_fraction: {target_fraction} to determine split_date")
                    # Use analyze_split_options to obtain candidate split points.
                    split_opts = self.analyze_split_options(data)
                    if split_opts:
                        # Choose the option whose train_fraction is closest to the target.
                        closest_option = min(split_opts, key=lambda x: abs(x['train_fraction'] - target_fraction))
                        ts_split_options['split_date'] = closest_option['split_time']
                        self.logger.info(f"Determined split_date: {ts_split_options['split_date']}")
                    else:
                        self.logger.warning("No valid split options found. Falling back to chronological split.")
                        ts_split_options['method'] = 'chronological'
                else:
                    raise ValueError("For sequence_aware splitting, either 'split_date' or 'target_train_fraction' must be provided in options.")
            # Use the (provided or determined) split_date to split.
            split_date = pd.to_datetime(ts_split_options['split_date'])
            self.logger.info(f"Performing sequence-aware split at split_date: {split_date}")
            train_data, test_data = self.split_dataset_sequence_aware(data_sorted, split_date)
        else:
            # Default to chronological splitting.
            test_size = self.options.get('split_dataset', {}).get('test_size', 0.2)
            train_data, test_data = self.split_time_series(data_sorted, test_size=test_size, random_state=42)
        
        # 4. Continue with the rest of the processing.
        self.validate_categorical_distributions(train_data, test_data)
        
        # Remove the time column from feature lists to avoid passing it to the pipeline.
        if self.time_column in self.numericals:
            self.numericals.remove(self.time_column)
        if self.time_column in self.ordinal_categoricals:
            self.ordinal_categoricals.remove(self.time_column)
        if self.time_column in self.nominal_categoricals:
            self.nominal_categoricals.remove(self.time_column)
        
        X_train = train_data.drop(columns=self.y_variable)
        X_clean = X_train.drop(columns=[self.time_column]) if self.time_column in X_train.columns else X_train
        y_train = train_data[self.y_variable]
        y_clean = y_train
        
        self.pipeline = self.build_pipeline(X_clean)
        X_preprocessed = self.pipeline.fit_transform(X_clean)
        
        # Process training data to create sequences.
        if self.sequence_categorical is not None:
            grouped = self._group_top_level(train_data)
            global_target_lengths = {}
            for group_key, group_data in grouped:
                subphases = self._segment_subphases(group_data, skip_min_samples=True)
                for phase, (phase_name, phase_array) in subphases.items():
                    current_len = phase_array.shape[0]
                    global_target_lengths[phase] = max(global_target_lengths.get(phase, 0), current_len)
            self.logger.debug(f"Global target lengths per phase: {global_target_lengths}")
            self.global_target_lengths = global_target_lengths
            
            aligned_groups = {}
            for group_key, group_data in grouped:
                subphases = self._segment_subphases(group_data)
                aligned_subphases = {}
                for phase, (phase_name, phase_array) in subphases.items():
                    target = global_target_lengths.get(phase, phase_array.shape[0])
                    try:
                        aligned = self._align_phase(phase_array, target, phase_name=phase_name)
                        aligned_subphases[phase] = aligned
                    except Exception as e:
                        self.logger.error(f"Alignment failed for group {group_key}, phase {phase}: {e}")
                        aligned_subphases[phase] = None
                missing = set(self.get_phase_order()) - set(aligned_subphases.keys())
                if missing:
                    self.logger.error(f"Group {group_key} invalid. Missing phases: {missing}")
                    self.logger.warning(f"Skipping group {group_key} due to missing phases.")
                elif all(aligned is not None for aligned in aligned_subphases.values()):
                    aligned_groups[group_key] = aligned_subphases
                else:
                    self.logger.warning(f"Skipping invalid group {group_key}")
            
            X_seq, group_labels = self.full_reassembly_pipeline(aligned_groups)
            
            y_seq = []
            for group_key in aligned_groups.keys():
                if isinstance(self.sequence_categorical, list):
                    mask = train_data[self.sequence_categorical].apply(lambda row: tuple(row) == group_key, axis=1)
                else:
                    mask = train_data[self.sequence_categorical] == group_key
                group_indices = train_data[mask].index
                y_group = y_clean.loc[group_indices].values
                if len(y_group) < self.horizon:
                    self.logger.error(f"Group {group_key} has insufficient target samples for horizon {self.horizon}")
                    raise ValueError(f"Insufficient target samples in group {group_key}")
                if self.horizon == 1:
                    y_seq.append(y_group[-1])
                else:
                    y_seq.append(y_group[-self.horizon:].squeeze())
            y_seq = np.array(y_seq)
        elif self.time_series_sequence_mode == "set_window":
            X_seq, y_seq = self.create_sequences(X_preprocessed, y_clean.values)
        else:
            raise ValueError(f"Invalid time_series_sequence_mode: {self.time_series_sequence_mode}")
        
        if y_seq is not None and not self.check_target_alignment(X_seq, y_seq):
            self.logger.warning("Target alignment check failed: Some sequences may not have matching target lengths.")
        
        recommendations = self.generate_recommendations()
        self.final_feature_order = list(self.pipeline.get_feature_names_out())
        self.save_transformers()
        
        # Process test data using the already fitted pipeline.
        X_test = test_data.drop(columns=self.y_variable)
        y_test = test_data[self.y_variable]
        X_test_preprocessed = self.pipeline.transform(X_test)
        
        if self.sequence_categorical is not None:
            test_grouped = self._group_top_level(test_data)
            test_aligned_groups = {}
            for group_key, group_data in test_grouped:
                subphases = self._segment_subphases(group_data)
                aligned_subphases = {}
                for phase, (phase_name, phase_array) in subphases.items():
                    target = global_target_lengths.get(phase, phase_array.shape[0])
                    try:
                        aligned = self._align_phase(phase_array, target, phase_name=phase_name)
                        aligned_subphases[phase] = aligned
                    except Exception as e:
                        self.logger.error(f"Test alignment failed for group {group_key}, phase {phase}: {e}")
                        aligned_subphases[phase] = None
                missing = set(self.get_phase_order()) - set(aligned_subphases.keys())
                if missing:
                    self.logger.error(f"Test group {group_key} invalid. Missing phases: {missing}")
                    self.logger.warning(f"Skipping test group {group_key} due to missing phases.")
                elif all(aligned is not None for aligned in aligned_subphases.values()):
                    test_aligned_groups[group_key] = aligned_subphases
                else:
                    self.logger.warning(f"Skipping invalid test group {group_key}")
            
            if test_aligned_groups:
                X_test_seq, test_group_labels = self.full_reassembly_pipeline(test_aligned_groups)
                y_test_seq = []
                for group_key in test_aligned_groups.keys():
                    if isinstance(self.sequence_categorical, list):
                        mask = test_data[self.sequence_categorical].apply(lambda row: tuple(row) == group_key, axis=1)
                    else:
                        mask = test_data[self.sequence_categorical] == group_key
                    group_indices = test_data[mask].index
                    y_group = y_test.loc[group_indices].values
                    if len(y_group) < self.horizon:
                        self.logger.error(f"Test group {group_key} has insufficient target samples for horizon {self.horizon}")
                        raise ValueError(f"Insufficient target samples in test group {group_key}")
                    if self.horizon == 1:
                        y_test_seq.append(y_group[-1])
                    else:
                        y_test_seq.append(y_group[-self.horizon:].squeeze())
                y_test_seq = np.array(y_test_seq)
            else:
                X_test_seq = None
                y_test_seq = None
        elif self.time_series_sequence_mode == "set_window":
            X_test_seq, y_test_seq = self.create_sequences(X_test_preprocessed, y_test.values)
        else:
            raise ValueError(f"Invalid time_series_sequence_mode: {self.time_series_sequence_mode}")
        
        return X_seq, X_test_seq, y_seq, y_test_seq, recommendations, None




    def adapt_sequence_shape(self, X_seq: np.ndarray, target_shape: tuple) -> np.ndarray:
        """
        Adapt the sequence shape to match the target shape required by the model.
        
        This method handles:
        1. Sequence length adaptation (padding/truncation)
        2. Feature dimension adaptation (padding/truncation)
        
        Args:
            X_seq (np.ndarray): Input sequence array with shape (batch, seq_len, features)
            target_shape (tuple): Target shape expected by the model (None, seq_len, features)
            
        Returns:
            np.ndarray: Adapted sequence with the target shape
        """
        current_shape = X_seq.shape
        
        # Extract the relevant dimensions (ignoring batch size)
        current_seq_len = current_shape[1]
        current_features = current_shape[2]
        target_seq_len = target_shape[1]
        target_features = target_shape[2]
        
        # No need to adapt if shapes already match
        if current_seq_len == target_seq_len and current_features == target_features:
            return X_seq
        
        self.logger.info(f"Adapting sequence shape: {current_shape} -> {(current_shape[0], target_seq_len, target_features)}")
        
        # Adapt sequence length first
        if current_seq_len != target_seq_len:
            self.logger.warning(f"Sequence length mismatch: {current_seq_len} vs {target_seq_len}")
            
            if current_seq_len > target_seq_len:
                # Truncation approach (by default take the most recent data points)
                self.logger.info(f"Truncating sequence length from {current_seq_len} to {target_seq_len}")
                X_seq = X_seq[:, -target_seq_len:, :]
            else:
                # Padding approach (add zeros at the beginning by default)
                self.logger.info(f"Padding sequence length from {current_seq_len} to {target_seq_len}")
                pad_length = target_seq_len - current_seq_len
                padding = np.zeros((X_seq.shape[0], pad_length, X_seq.shape[2]))
                X_seq = np.concatenate([padding, X_seq], axis=1)
        
        # Then adapt feature dimension if needed
        if current_features != target_features:
            self.logger.warning(f"Feature dimension mismatch: {current_features} vs {target_features}")
            
            if current_features > target_features:
                # Truncate features (take the first target_features features)
                self.logger.info(f"Truncating feature dimension from {current_features} to {target_features}")
                X_seq = X_seq[:, :, :target_features]
            else:
                # Pad features with zeros
                self.logger.info(f"Padding feature dimension from {current_features} to {target_features}")
                pad_width = target_features - current_features
                padding = np.zeros((X_seq.shape[0], X_seq.shape[1], pad_width))
                X_seq = np.concatenate([X_seq, padding], axis=2)
        
        self.logger.info(f"Final sequence shape after adaptation: {X_seq.shape}")
        return X_seq



    def preprocess_predict_time_series(self, X: pd.DataFrame) -> np.ndarray:
        """
        Preprocess new time series data for prediction.
        
        Args:
            X (pd.DataFrame): New data for prediction
                
        Returns:
            np.ndarray: Preprocessed data ready for prediction, with shape adaptations if needed
        """
        self.logger.info("Preprocessing time series data for prediction")
        
        # Load transformers
        transformers = self.load_transformers()
        self.pipeline = transformers.get('preprocessor')
        if self.pipeline is None:
            raise ValueError("Failed to load preprocessing pipeline from transformers")
        
        # Handle missing values and outliers
        X_clean, _ = self.handle_missing_values(X)
        
        # Sort by time column if available
        if self.time_column and self.time_column in X_clean.columns:
            X_clean['__time__'] = pd.to_datetime(X_clean[self.time_column])
            X_clean = X_clean.sort_values(by='__time__').drop(columns=['__time__'])
        
        # Process based on sequence mode
        if self.time_series_sequence_mode == "set_window":
            # For set_window, create sequences from the most recent data
            X_no_target = X_clean.drop(columns=self.y_variable, errors='ignore')
            X_preprocessed = self.pipeline.transform(X_no_target)
            
            self.logger.debug(f"Preprocessed data shape before reshaping: {X_preprocessed.shape}")
            self.logger.debug(f"Using window_size: {self.window_size} for reshaping")
            
            total_required = self.window_size
            if len(X_preprocessed) < total_required:
                raise ValueError(f"Insufficient data for prediction: need at least {total_required} samples")
            
            # Take the most recent window
            X_seq = X_preprocessed[-total_required:].reshape(1, total_required, -1)
            self.logger.debug(f"Sequence shape after initial reshaping: {X_seq.shape}")
        
        elif self.time_series_sequence_mode in ["dtw", "pad"]:
            # For DTW/pad, use the process_dtw_or_pad method
            # First, create a dummy target variable required by process_dtw_or_pad
            dummy_target = pd.DataFrame({self.y_variable[0]: np.zeros(len(X_clean))})
            X_with_dummy = pd.concat([X_clean, dummy_target], axis=1)
            
            # Process the data using the dtw/pad logic
            X_seq, _ = self.process_dtw_or_pad(X_with_dummy)
            self.logger.debug(f"Sequence shape after dtw/pad processing: {X_seq.shape}")
        
        elif self.time_series_sequence_mode == "variable_length":
            # For variable length, preprocess and use all available data
            X_no_target = X_clean.drop(columns=self.y_variable, errors='ignore')
            X_preprocessed = self.pipeline.transform(X_no_target)
            X_seq = X_preprocessed.reshape(1, len(X_preprocessed), -1)
            self.logger.debug(f"Sequence shape after variable_length processing: {X_seq.shape}")
        
        else:
            raise ValueError(f"Invalid time_series_sequence_mode: {self.time_series_sequence_mode}")
        
        # NEW: Check if we need to adapt the sequence shape to match the model's expectations
        if hasattr(self, 'expected_model_shape') and self.expected_model_shape is not None:
            X_seq = self.adapt_sequence_shape(X_seq, self.expected_model_shape)
        
        return X_seq



    def visualize_psi_results(self, data: pd.DataFrame, top_n: int = 10):
        """
        Visualize PSI results for the top N features with highest PSI values.
        
        Args:
            data (pd.DataFrame): Original data
            top_n (int): Number of top features to visualize
        """
        if not hasattr(self, 'psi_values') or not self.psi_values:
            self.logger.warning("No PSI values available. Run apply_psi_feature_selection first.")
            return
        
        import matplotlib.pyplot as plt
        import seaborn as sns
        
        # Sort features by PSI value
        sorted_psi = sorted(self.psi_values.items(), key=lambda x: x[1], reverse=True)
        top_features = sorted_psi[:top_n]
        
        # Plot PSI values
        plt.figure(figsize=(12, 6))
        sns.barplot(x=[f[0] for f in top_features], y=[f[1] for f in top_features])
        plt.title(f'Top {top_n} Features by PSI Value')
        plt.xlabel('Feature')
        plt.ylabel('PSI Value')
        plt.xticks(rotation=45, ha='right')
        plt.axhline(y=0.1, color='green', linestyle='--', label='Minor Change (0.1)')
        plt.axhline(y=0.2, color='red', linestyle='--', label='Significant Change (0.2)')
        plt.legend()
        plt.tight_layout()
        plt.savefig(os.path.join(self.graphs_output_dir, 'psi_values.png'))
        plt.close()
        
        # Get reference and test sets
        split_frac = self.options.get('psi_feature_selection', {}).get('split_frac', 0.75)
        split_idx = int(split_frac * len(data))
        reference = data.iloc[:split_idx]
        test = data.iloc[split_idx:]
        
        # Plot distribution comparisons for top features
        for feature, psi_value in top_features:
            if feature in data.columns:
                plt.figure(figsize=(10, 6))
                plt.subplot(2, 1, 1)
                sns.histplot(reference[feature].dropna(), color='blue', label='Reference', kde=True)
                sns.histplot(test[feature].dropna(), color='red', label='Test', kde=True)
                plt.title(f'Distribution Comparison for {feature} (PSI={psi_value:.4f})')
                plt.legend()
                
                plt.subplot(2, 1, 2)
                sns.boxplot(data=[reference[feature].dropna(), test[feature].dropna()], 
                        width=0.5)
                plt.xticks([0, 1], ['Reference', 'Test'])
                plt.tight_layout()
                plt.savefig(os.path.join(self.graphs_output_dir, f'psi_dist_{feature}.png'))
                plt.close()



    def preprocess_train(self, X: pd.DataFrame, y: pd.Series) -> Tuple[pd.DataFrame, pd.DataFrame, pd.Series, pd.Series, pd.DataFrame, Optional[pd.DataFrame]]:
        """
        Preprocess training data for various model types.
        For time series models, delegate to preprocess_time_series.
        
        Returns:
            - For standard models: X_train_final, X_test_final, y_train_smoted, y_test, recommendations, X_test_inverse.
            - For time series models: X_seq, None, y_seq, None, recommendations, None.
        """
        # If the model is time series, use the dedicated time series preprocessing flow.
        if self.model_category == 'time_series':
            return self.preprocess_time_series(X, y)
        
        # Get split options from configuration
        split_options = self.options.get('split_dataset', {})
        split_ratio = split_options.get('test_size', 0.2)
        time_split_column = split_options.get('time_split_column', None)
        time_split_value = split_options.get('time_split_value', None)
        
        # Standard preprocessing flow for classification/regression/clustering
        X_train_original, X_test_original, y_train_original, y_test = self.split_dataset(
            X, y, 
            split_ratio=split_ratio,
            time_split_column=time_split_column,
            time_split_value=time_split_value
        )
        
        X_train_missing_values, X_test_missing_values = self.handle_missing_values(X_train_original, X_test_original)
        
        # Only perform normality tests if applicable
        if self.model_category in ['regression', 'classification', 'clustering']:
            self.test_normality(X_train_missing_values)
        
        X_train_outliers_handled, y_train_outliers_handled = self.handle_outliers(X_train_missing_values, y_train_original)
        X_test_outliers_handled = X_test_missing_values.copy() if X_test_missing_values is not None else None
        recommendations = self.generate_recommendations()
        self.pipeline = self.build_pipeline(X_train_outliers_handled)
        X_train_preprocessed = self.pipeline.fit_transform(X_train_outliers_handled)
        X_test_preprocessed = self.pipeline.transform(X_test_outliers_handled) if X_test_outliers_handled is not None else None

        if self.model_category == 'classification':
            try:
                X_train_smoted, y_train_smoted = self.implement_smote(X_train_preprocessed, y_train_outliers_handled)
            except Exception as e:
                self.logger.error(f"❌ SMOTE application failed: {e}")
                raise
        else:
            X_train_smoted, y_train_smoted = X_train_preprocessed, y_train_outliers_handled
            self.logger.info("⚠️ SMOTE not applied: Not a classification model.")

        self.final_feature_order = list(self.pipeline.get_feature_names_out())
        X_train_final = pd.DataFrame(X_train_smoted, columns=self.final_feature_order)
        X_test_final = pd.DataFrame(X_test_preprocessed, columns=self.final_feature_order, index=X_test_original.index) if X_test_preprocessed is not None else None

        try:
            self.save_transformers()
        except Exception as e:
            self.logger.error(f"❌ Saving transformers failed: {e}")
            raise

        try:
            if X_test_final is not None:
                X_test_inverse = self.inverse_transform_data(X_test_final.values, original_data=X_test_original)
                self.logger.info("✅ Inverse transformations applied successfully.")
            else:
                X_test_inverse = None
        except Exception as e:
            self.logger.error(f"❌ Inverse transformations failed: {e}")
            X_test_inverse = None

        return X_train_final, X_test_final, y_train_smoted, y_test, recommendations, X_test_inverse



    def preprocess_predict(self, X: pd.DataFrame) -> Tuple[pd.DataFrame, pd.DataFrame, Optional[pd.DataFrame]]:
        """
        Preprocess new data for prediction.

        Args:
            X (pd.DataFrame): New data for prediction.

        Returns:
            Tuple[pd.DataFrame, pd.DataFrame, Optional[pd.DataFrame]]: X_preprocessed, recommendations, X_inversed
        """
        step_name = "Preprocess Predict"
        self.logger.info(f"Step: {step_name}")

        # Log initial columns and feature count
        self.logger.debug(f"Initial columns in prediction data: {X.columns.tolist()}")
        self.logger.debug(f"Initial number of features: {X.shape[1]}")

        # Load transformers
        try:
            transformers = self.load_transformers()
            self.logger.debug("Transformers loaded successfully.")
        except Exception as e:
            self.logger.error(f"❌ Failed to load transformers: {e}")
            raise

        # Filter columns based on raw feature names
        try:
            X_filtered = self.filter_columns(X)
            self.logger.debug(f"Columns after filtering: {X_filtered.columns.tolist()}")
            self.logger.debug(f"Number of features after filtering: {X_filtered.shape[1]}")
        except Exception as e:
            self.logger.error(f"❌ Failed during column filtering: {e}")
            raise

        # Handle missing values
        try:
            X_filtered, _ = self.handle_missing_values(X_filtered)
            self.logger.debug(f"Columns after handling missing values: {X_filtered.columns.tolist()}")
            self.logger.debug(f"Number of features after handling missing values: {X_filtered.shape[1]}")
        except Exception as e:
            self.logger.error(f"❌ Failed during missing value handling: {e}")
            raise

        # Ensure all expected raw features are present
        expected_raw_features = self.numericals + self.ordinal_categoricals + self.nominal_categoricals
        provided_features = X_filtered.columns.tolist()

        self.logger.debug(f"Expected raw features: {expected_raw_features}")
        self.logger.debug(f"Provided features: {provided_features}")

        missing_raw_features = set(expected_raw_features) - set(provided_features)
        if missing_raw_features:
            self.logger.error(f"❌ Missing required raw feature columns in prediction data: {missing_raw_features}")
            raise ValueError(f"Missing required raw feature columns in prediction data: {missing_raw_features}")

        # Handle unexpected columns (optional: ignore or log)
        unexpected_features = set(provided_features) - set(expected_raw_features)
        if unexpected_features:
            self.logger.warning(f"⚠️ Unexpected columns in prediction data that will be ignored: {unexpected_features}")

        # Ensure the order of columns matches the pipeline's expectation (optional)
        X_filtered = X_filtered[expected_raw_features]
        self.logger.debug("Reordered columns to match the pipeline's raw feature expectations.")

        # Transform data using the loaded pipeline
        try:
            X_preprocessed_np = self.pipeline.transform(X_filtered)
            self.logger.debug(f"Transformed data shape: {X_preprocessed_np.shape}")
        except Exception as e:
            self.logger.error(f"❌ Transformation failed: {e}")
            raise

        # Retrieve feature names from the pipeline or use stored final_feature_order
        if hasattr(self.pipeline, 'get_feature_names_out'):
            try:
                columns = self.pipeline.get_feature_names_out()
                self.logger.debug(f"Derived feature names from pipeline: {columns.tolist()}")
            except Exception as e:
                self.logger.warning(f"Could not retrieve feature names from pipeline: {e}")
                columns = self.final_feature_order
                self.logger.debug(f"Using stored final_feature_order for column names: {columns}")
        else:
            columns = self.final_feature_order
            self.logger.debug(f"Using stored final_feature_order for column names: {columns}")

        # Convert NumPy array back to DataFrame with correct column names
        try:
            X_preprocessed_df = pd.DataFrame(X_preprocessed_np, columns=columns, index=X_filtered.index)
            self.logger.debug(f"X_preprocessed_df columns: {X_preprocessed_df.columns.tolist()}")
            self.logger.debug(f"Sample of X_preprocessed_df:\n{X_preprocessed_df.head()}")
        except Exception as e:
            self.logger.error(f"❌ Failed to convert transformed data to DataFrame: {e}")
            raise

        # Inverse transform for interpretability (optional, for interpretability)
        try:
            self.logger.debug(f"[DEBUG] Original data shape before inverse transform: {X.shape}")
            X_inversed = self.inverse_transform_data(X_preprocessed_np, original_data=X)
            self.logger.debug(f"[DEBUG] Inversed data shape: {X_inversed.shape}")
        except Exception as e:
            self.logger.error(f"❌ Inverse transformation failed: {e}")
            X_inversed = None

        # Generate recommendations (if applicable)
        try:
            recommendations = self.generate_recommendations()
            self.logger.debug("Generated preprocessing recommendations.")
        except Exception as e:
            self.logger.error(f"❌ Failed to generate recommendations: {e}")
            recommendations = pd.DataFrame()

        # Prepare outputs
        return X_preprocessed_df, recommendations, X_inversed

    def preprocess_clustering(self, X: pd.DataFrame) -> Tuple[pd.DataFrame, pd.DataFrame]:
        """
        Preprocess data for clustering mode.

        Args:
            X (pd.DataFrame): Input features for clustering.

        Returns:
            Tuple[pd.DataFrame, pd.DataFrame]: X_processed, recommendations.
        """
        step_name = "Preprocess Clustering"
        self.logger.info(f"Step: {step_name}")
        debug_flag = self.get_debug_flag('debug_handle_missing_values')  # Use relevant debug flags

        # Handle Missing Values
        X_missing, _ = self.handle_missing_values(X, None)
        self.logger.debug(f"After handling missing values: X_missing.shape={X_missing.shape}")

        # Handle Outliers
        X_outliers_handled, _ = self.handle_outliers(X_missing, None)
        self.logger.debug(f"After handling outliers: X_outliers_handled.shape={X_outliers_handled.shape}")

        # Test Normality (optional for clustering)
        if self.model_category in ['clustering']:
            self.logger.info("Skipping normality tests for clustering.")
        else:
            self.test_normality(X_outliers_handled)

        # Generate Preprocessing Recommendations
        recommendations = self.generate_recommendations()

        # Build and Fit the Pipeline
        self.pipeline = self.build_pipeline(X_outliers_handled)
        self.logger.debug("Pipeline built and fitted.")

        # Transform the data
        X_processed = self.pipeline.transform(X_outliers_handled)
        self.logger.debug(f"After pipeline transform: X_processed.shape={X_processed.shape}")

        # Optionally, inverse transformations can be handled if necessary

        # Save Transformers (if needed)
        # Not strictly necessary for clustering unless you plan to apply the same preprocessing on new data
        self.save_transformers()

        self.logger.info("✅ Clustering data preprocessed successfully.")

        return X_processed, recommendations

    def final_preprocessing(self, data: pd.DataFrame, model_input_shape=None) -> Tuple:
        """
        Execute the full preprocessing pipeline based on the mode.
        
        Args:
            data (pd.DataFrame): The input data to preprocess.
            model_input_shape (tuple, optional): Expected model input shape for reshape operations.
                                            Useful during prediction to ensure compatibility.
        
        Returns:
            Tuple: Depending on mode:
                - 'train': For standard models: X_train, X_test, y_train, y_test, recommendations, X_test_inverse.
                        For time series models: X_train_seq, X_test_seq, y_train_seq, y_test_seq, recommendations, None.
                - 'predict': X_preprocessed, recommendations, X_inverse.
                - 'clustering': X_processed, recommendations.
        """
        self.logger.info(f"Starting: Final Preprocessing Pipeline in '{self.mode}' mode.")
        
        # Store expected model shape if provided
        if model_input_shape is not None:
            self.expected_model_shape = model_input_shape
            self.logger.info(f"Using provided model input shape: {model_input_shape}")
        
        # First, filter the data columns based on configuration.
        try:
            data = self.filter_columns(data)
            self.logger.info("✅ Column filtering completed successfully.")
        except Exception as e:
            self.logger.error(f"❌ Column filtering failed: {e}")
            raise
                
        if self.mode == 'train':
            if self.model_category == 'time_series':
                # For time series training, use the dedicated time series preprocessing flow.
                X_train_seq, X_test_seq, y_train_seq, y_test_seq, recommendations, extras = self.preprocess_time_series(data)
                
                # Store the actual output shape for future reference
                if X_train_seq is not None and hasattr(X_train_seq, 'shape') and len(X_train_seq.shape) >= 3:
                    self.actual_output_shape = (None,) + X_train_seq.shape[1:]
                    self.logger.debug(f"Storing actual output shape: {self.actual_output_shape}")
                    
                    # If expected shape not provided, use the actual output shape
                    if not hasattr(self, 'expected_model_shape') or self.expected_model_shape is None:
                        self.expected_model_shape = self.actual_output_shape
                        self.logger.debug(f"Setting expected model shape to actual output shape: {self.expected_model_shape}")
                    
                    # Save shape information with transformers
                    try:
                        self.save_transformers(model_input_shape=self.expected_model_shape)
                    except Exception as e:
                        self.logger.warning(f"Could not save model shape information: {e}")
                
                return X_train_seq, X_test_seq, y_train_seq, y_test_seq, recommendations, extras
            else:
                # Standard processing for non-time series models.
                if not all(col in data.columns for col in self.y_variable):
                    missing_y = [col for col in self.y_variable if col not in data.columns]
                    raise ValueError(f"Target variable(s) {missing_y} not found in the dataset.")
                X = data.drop(self.y_variable, axis=1)
                y = data[self.y_variable].iloc[:, 0] if len(self.y_variable) == 1 else data[self.y_variable]
                return self.preprocess_train(X, y)
        
        elif self.mode == 'predict':
            X = data.copy()
            # If the model is time series, use the dedicated prediction function.
            if self.model_category == 'time_series':
                X_preprocessed = self.preprocess_predict_time_series(X)
                
                # Apply shape adaptation if needed
                if hasattr(self, 'expected_model_shape') and self.expected_model_shape is not None:
                    X_preprocessed = self.adapt_sequence_shape(X_preprocessed, self.expected_model_shape)
                    
                self.logger.info("✅ Preprocessing completed successfully in time series predict mode.")
                # For time series prediction, recommendations and inverse transformation may not be applicable.
                return X_preprocessed, None, None
            else:
                transformers_path = os.path.join(self.transformers_dir, 'transformers.pkl')
                if not os.path.exists(transformers_path):
                    self.logger.error(f"❌ Transformers file not found at '{self.transformers_dir}'. Cannot proceed with prediction.")
                    raise FileNotFoundError(f"Transformers file not found at '{self.transformers_dir}'.")
                X_preprocessed, recommendations, X_inversed = self.preprocess_predict(X)
                self.logger.info("✅ Preprocessing completed successfully in predict mode.")
                return X_preprocessed, recommendations, X_inversed
        
        elif self.mode == 'clustering':
            X = data.copy()
            return self.preprocess_clustering(X)
        
        else:
            raise NotImplementedError(f"Mode '{self.mode}' is not implemented.")






if __name__ == "__main__":
    # Notes:
    # Comprehensive test of the DataPreprocessor for time series data with LSTM models.
    # Tests both train and predict modes with different splitting strategies.
    import pandas as pd
    import numpy as np
    from datetime import datetime
    import matplotlib.pyplot as plt
    import os
    from sklearn.metrics import mean_absolute_error, mean_squared_error
    from tensorflow.keras.models import Sequential, load_model
    from tensorflow.keras.layers import LSTM, Dense, Dropout
    import shutil
    import yaml
    
    # Load configuration from YAML file
    config_file = "../../dataset/test/preprocessor_config/preprocessor_config_baseball.yaml"
    with open(config_file, 'r') as f:
        config = yaml.safe_load(f)

    # 1. Load your training data using the configured data_dir and raw_data path.
    data_path = os.path.join(config["paths"]["data_dir"], config["paths"]["raw_data"])
    data = pd.read_parquet(data_path)


    # -------------------------------------------------------------------------
    # Custom dataset adjustments
    # Filter out time_step column
    data = data.drop('time_step', axis=1, errors='ignore')

    # check on the datetime columns unique values
    print("unique datetime metrics ================", data['biomech_datetime'].unique())

    # Display columns
    print("\nDataset columns:")
    for col in data.columns:
        print(f"- {col}")

    # Check for null sums in the filtered data
    null_sums = data.isnull().sum()
    if null_sums.any():
        print("[WARNING] Found null values in the following columns:")
        print(null_sums[null_sums > 0])
    else:
        print("[INFO] Dataset contains no null values and is ready for machine learning.")

    # Filter out "Follow Through" phase
    data = data[data['pitch_phase_biomech'] != 'Follow Through']
    print(f"[INFO] Training data loaded from {data_path}. Shape: {data.shape}")
    print(f"[INFO] Filtered out 'Follow Through' phase. New shape: {data.shape}")

    print("Available config keys:", config.keys())
    options = config.get("time_series", {})
    if not options:
        raise KeyError("The configuration is missing the 'time_series' key. Please verify the YAML configuration.")

    print(data.head())
    
    # Define model building function
    def build_lstm_model(input_shape, horizon=1):
        """
        Build an LSTM model with an output layer that matches the specified horizon.
        
        Args:
            input_shape: Tuple defining the input shape (timesteps, features)
            horizon: Number of future timesteps to predict (output dimension)
            
        Returns:
            A compiled Keras Sequential model
        """
        model = Sequential([
            LSTM(64, input_shape=input_shape, return_sequences=True),
            Dropout(0.2),
            LSTM(32),
            Dropout(0.2),
            Dense(horizon)  # Output dimension now dynamically set by horizon
        ])
        model.compile(optimizer='adam', loss='mse', metrics=['mae'])
        return model
    
    def get_horizon_from_preprocessor(preprocessor):
        """
        Extract the horizon parameter from the preprocessor options.
        
        Args:
            preprocessor: DataPreprocessor instance
            
        Returns:
            horizon: Integer value of the horizon parameter (default: 1)
        """
        if hasattr(preprocessor, 'options') and isinstance(preprocessor.options, dict):
            return preprocessor.options.get('horizon', 1)
        elif hasattr(preprocessor, 'horizon'):
            return preprocessor.horizon
        else:
            return 1  # Default horizon if not specified

    
    # # ---------- Test 1: Percentage-based Split ----------
    # print("\n\n=== Test 1: Percentage-based Split (80/20) ===")

    # # Clean transformers directory
    # shutil.rmtree('./transformers', ignore_errors=True)
    # os.makedirs('./transformers', exist_ok=True)

    # # Configure the preprocessor for training without explicit window_size, step_size, or horizon parameters.
    # preprocessor = DataPreprocessor(
    #     model_type="LSTM",
    #     y_variable=config["features"]["y_variable"],
    #     ordinal_categoricals=config["features"]["ordinal_categoricals"],
    #     nominal_categoricals=config["features"]["nominal_categoricals"],
    #     numericals=config["features"]["numericals"],
    #     mode="train",
    #     options={
    #         "enabled": True,
    #         "time_column": "biomech_datetime",
    #         "horizon": 10,  # Horizon value is provided here
    #         "step_size": 1,  # Step size provided here
    #         "sequence_modes": {
    #             "set_window": {
    #                 "window_size": 10,  # Window size provided here
    #                 "max_sequence_length": 10
    #             }
    #         },
    #         "ts_sequence_mode": "set_window",
    #         "split_dataset": {
    #             "test_size": 0.2,
    #             "random_state": 42
    #         },
    #         "time_series_split": {
    #             "method": "standard"
    #         }
    #     },
    #     sequence_categorical=["session_biomech", "trial_biomech"],
    #     sequence_dtw_or_pad_categorical=["pitch_phase_biomech"],
    #     time_series_sequence_mode="set_window",
    #     debug=True
    # )

    # # Preprocess the training data
    # X_train_seq, X_test_seq, y_train_seq, y_test_seq, recommendations, _ = preprocessor.final_preprocessing(data)


    # print(f"Train shapes - X: {X_train_seq.shape}, y: {y_train_seq.shape}")
    # print(f"Test shapes - X: {X_test_seq.shape}, y: {y_test_seq.shape}")
    
    # # Train a model
    # print("Training LSTM model with percentage-based split...")
    # # Extract horizon from preprocessor
    # horizon = get_horizon_from_preprocessor(preprocessor)
    # print(f"Using horizon of {horizon} for model output dimension")

    # # Build the LSTM model using the extracted horizon
    # model1 = build_lstm_model((X_train_seq.shape[1], X_train_seq.shape[2]), horizon=horizon)

    # model1.fit(
    #     X_train_seq, y_train_seq, 
    #     validation_data=(X_test_seq, y_test_seq),
    #     epochs=10, batch_size=32, verbose=1
    # )
    # model1.save('./transformers/model_percentage_split.h5')
    
    # # Predict using the test set
    # predictions = model1.predict(X_test_seq)
    # print(f"Predictions shape: {predictions.shape}, Target shape: {y_test_seq.shape}")
    # if predictions.shape[-1] != y_test_seq.shape[-1]:
    #     print(f"WARNING: Shape mismatch detected: predictions {predictions.shape} vs targets {y_test_seq.shape}")
        
    # # Apply dimension compatibility function
    # y_test_seq, predictions = ensure_compatible_dimensions(y_test_seq, predictions)

    # mae = mean_absolute_error(y_test_seq, predictions)
    # rmse = np.sqrt(mean_squared_error(y_test_seq, predictions))
    # print(f"Model evaluation - MAE: {mae:.4f}, RMSE: {rmse:.4f}")

    
    # # Predict using predict mode
    # print("\nTesting prediction mode with new data...")
    
    # # Take the last segment of data as "new" data for prediction
    # new_data = data.iloc[-48:].copy()  
    
    # # Configure the preprocessor for prediction
    # predict_preprocessor = DataPreprocessor(
    #     model_type="LSTM",
    #     y_variable=config["features"]["y_variable"],
    #     ordinal_categoricals=config["features"]["ordinal_categoricals"],
    #     nominal_categoricals=config["features"]["nominal_categoricals"],
    #     numericals=config["features"]["numericals"],
    #     mode="predict",
    #     options={
    #         "enabled": True,
    #         "time_column": "biomech_datetime",
    #         "horizon": 10,              # Number of time steps ahead to predict
    #         "step_size": 1,             # Step size for moving the window
    #         "sequence_modes": {         # Window configuration for sequence mode
    #             "set_window": {
    #                 "window_size": 10,         # Size of each window
    #                 "max_sequence_length": 10  # Maximum sequence length
    #             }
    #         },
    #         "ts_sequence_mode": "set_window"
    #     },
    #     sequence_categorical=["session_biomech", "trial_biomech"],
    #     sequence_dtw_or_pad_categorical=["pitch_phase_biomech"],
    #     transformers_dir="./transformers"
    # )

    

    # # Make predictions
    # model1 = load_model('./transformers/model_percentage_split.h5')
    # expected_shape = model1.input_shape
    # print(f"Expected model input shape: {expected_shape}")

    # # Preprocess new data for prediction
    # X_new_preprocessed, _, _ = predict_preprocessor.final_preprocessing(new_data, model_input_shape=expected_shape)
    
    # print(f"Prediction data shape: {X_new_preprocessed.shape}")
    
    # predictions = model1.predict(X_new_preprocessed)
    # print(f"Prediction results shape: {predictions.shape}")
    # print(f"Predictions: {predictions[:5].flatten()}")
    
    # # ---------- Test 2: Date-based Split ----------
    # print("\n\n=== Test 2: Date-based Split (2025-02-14 11:00) ===")
    
    # # Clean transformers directory
    # shutil.rmtree('./transformers', ignore_errors=True)
    # os.makedirs('./transformers', exist_ok=True)
    
    # # Configure the preprocessor for training with date-based split
    # preprocessor = DataPreprocessor(
    #     model_type="LSTM",
    #     y_variable=config["features"]["y_variable"],
    #     ordinal_categoricals=config["features"]["ordinal_categoricals"],
    #     nominal_categoricals=config["features"]["nominal_categoricals"],
    #     numericals=config["features"]["numericals"],
    #     mode="train",
    #     options={
    #         "enabled": True,
    #         "time_column": "biomech_datetime",
    #         "horizon": 10,
    #         "step_size": 1,
    #         "sequence_modes": {
    #             "set_window": {
    #                 "window_size": 10,  # 1 day window
    #                 "max_sequence_length": 10
    #             }
    #         },
    #         "ts_sequence_mode": "set_window",
    #         "split_dataset": {
    #             "test_size": 0.2,  # Not used for date-based split
    #             "random_state": 42,
    #             "time_split_column": "biomech_datetime",
    #             "time_split_value": pd.Timestamp("2025-02-14 11:50:00")
    #         },
    #         "time_series_split": {
    #             "method": "standard"
    #         }
    #     },
    #     sequence_categorical=["session_biomech", "trial_biomech"],
    #     sequence_dtw_or_pad_categorical=["pitch_phase_biomech"],
    #     time_series_sequence_mode="set_window",
    #     debug=True
    # )
    
    # # Preprocess the training data
    # X_train_seq, X_test_seq, y_train_seq, y_test_seq, recommendations, _ = preprocessor.final_preprocessing(data)
    
    # print(f"Train shapes - X: {X_train_seq.shape}, y: {y_train_seq.shape}")
    # print(f"Test shapes - X: {X_test_seq.shape}, y: {y_test_seq.shape}")
    
    # # Train a model
    # print("Training LSTM model with date-based split...")
    # # Extract horizon from preprocessor
    # horizon = get_horizon_from_preprocessor(preprocessor)
    # print(f"Using horizon of {horizon} for model output dimension")

    # # Build the LSTM model using the extracted horizon
    # model2 = build_lstm_model((X_train_seq.shape[1], X_train_seq.shape[2]), horizon=horizon)

    # model2.fit(
    #     X_train_seq, y_train_seq, 
    #     validation_data=(X_test_seq, y_test_seq),
    #     epochs=10, batch_size=32, verbose=1
    # )
    # model2.save('./transformers/model_date_split.h5')
    
    # # Test prediction mode
    # print("\nTesting prediction mode with new data...")
    
    # # Take the last segment of data as "new" data for prediction
    # new_data = data.iloc[-48:].copy()  
    
    # # Configure the preprocessor for prediction
    # predict_preprocessor = DataPreprocessor(
    #     model_type="LSTM",
    #     y_variable=config["features"]["y_variable"],
    #     ordinal_categoricals=config["features"]["ordinal_categoricals"],
    #     nominal_categoricals=config["features"]["nominal_categoricals"],
    #     numericals=config["features"]["numericals"],
    #     mode="predict",
    #     options={
    #         "enabled": True,
    #         "time_column": "biomech_datetime",
    #         "horizon": 10,              # Number of time steps ahead to predict
    #         "step_size": 1,             # Step size for moving the window
    #         "sequence_modes": {         # Window configuration for sequence mode
    #             "set_window": {
    #                 "window_size": 10,         # Size of each window
    #                 "max_sequence_length": 10  # Maximum sequence length
    #             }
    #         },
    #         "ts_sequence_mode": "set_window"
    #     },
    #     sequence_categorical=["session_biomech", "trial_biomech"],
    #     sequence_dtw_or_pad_categorical=["pitch_phase_biomech"],
    #     transformers_dir="./transformers"
    # )

    
    # # Preprocess new data for prediction
    # X_new_preprocessed, _, _ = predict_preprocessor.final_preprocessing(new_data)
    
    # # Make predictions
    # model2 = load_model('./transformers/model_date_split.h5')
    # predictions = model2.predict(X_new_preprocessed)
    # print(f"Prediction results shape: {predictions.shape}")
    
    # # ---------- Test 3: PSI-based Split with Feature-Engine ----------
    # print("\n\n=== Test 3: PSI-based Split with Feature-Engine ===")
    
    # # Clean transformers directory
    # shutil.rmtree('./transformers', ignore_errors=True)
    # os.makedirs('./transformers', exist_ok=True)
    
    # # Configure the preprocessor for training with PSI-based split
    # preprocessor = DataPreprocessor(
    #     model_type="LSTM",
    #     y_variable=config["features"]["y_variable"],
    #     ordinal_categoricals=config["features"]["ordinal_categoricals"],
    #     nominal_categoricals=config["features"]["nominal_categoricals"],
    #     numericals=config["features"]["numericals"],
    #     mode="train",
    #     options={
    #         "enabled": True,
    #         "time_column": "biomech_datetime",
    #         "horizon": 10,
    #         "step_size": 1,
    #         "sequence_modes": {
    #             "set_window": {
    #                 "window_size": 10,  # 1 day window
    #                 "max_sequence_length": 10
    #             }
    #         },
    #         "ts_sequence_mode": "set_window",
    #         "psi_feature_selection": {
    #             "enabled": True,
    #             "threshold": 0.25,
    #             "split_frac": 0.75,
    #             "split_distinct": False,
    #             "apply_before_split": True
    #         },
    #         "feature_engine_split": {
    #             "enabled": True,
    #             "split_frac": 0.75,
    #             "split_distinct": False
    #         },
    #         "time_series_split": {
    #             "method": "feature_engine"
    #         }
    #     },
    #     sequence_categorical=["session_biomech", "trial_biomech"],
    #     sequence_dtw_or_pad_categorical=["pitch_phase_biomech"],
    #     time_series_sequence_mode="set_window",
    #     debug=True,
    #     graphs_output_dir="./plots"
    # )
    
    # # Preprocess the training data
    # X_train_seq, X_test_seq, y_train_seq, y_test_seq, recommendations, _ = preprocessor.final_preprocessing(data)
    
    # print(f"Train shapes - X: {X_train_seq.shape}, y: {y_train_seq.shape}")
    # print(f"Test shapes - X: {X_test_seq.shape}, y: {y_test_seq.shape}")
    
    # # Visualize PSI results if the method was run
    # preprocessor.visualize_psi_results(data, top_n=5)
    
    # # Train a model
    # print("Training LSTM model with PSI-based split...")
    # # Extract horizon from preprocessor
    # horizon = get_horizon_from_preprocessor(preprocessor)
    # print(f"Using horizon of {horizon} for model output dimension")

    # # Build the LSTM model using the extracted horizon
    # model3 = build_lstm_model((X_train_seq.shape[1], X_train_seq.shape[2]), horizon=horizon)

    # model3.fit(
    #     X_train_seq, y_train_seq, 
    #     validation_data=(X_test_seq, y_test_seq),
    #     epochs=10, batch_size=32, verbose=1
    # )
    # model3.save('./transformers/model_psi_split.h5')
    
    # # Test prediction mode
    # print("\nTesting prediction mode with new data...")
    
    # # Take the last segment of data as "new" data for prediction
    # new_data = data.iloc[-48:].copy()  
    
    # # Configure the preprocessor for prediction
    # predict_preprocessor = DataPreprocessor(
    #     model_type="LSTM",
    #     y_variable=config["features"]["y_variable"],
    #     ordinal_categoricals=config["features"]["ordinal_categoricals"],
    #     nominal_categoricals=config["features"]["nominal_categoricals"],
    #     numericals=config["features"]["numericals"],
    #     mode="predict",
    #     options={
    #         "enabled": True,
    #         "time_column": "biomech_datetime",
    #         "horizon": 10,              # Number of time steps ahead to predict
    #         "step_size": 1,             # Step size for moving the window
    #         "sequence_modes": {         # Window configuration for sequence mode
    #             "set_window": {
    #                 "window_size": 10,         # Size of each window
    #                 "max_sequence_length": 10  # Maximum sequence length
    #             }
    #         },
    #         "ts_sequence_mode": "set_window"
    #     },
    #     sequence_categorical=["session_biomech", "trial_biomech"],
    #     sequence_dtw_or_pad_categorical=["pitch_phase_biomech"],
    #     transformers_dir="./transformers"
    # )

    
    # # Preprocess new data for prediction
    # X_new_preprocessed, _, _ = predict_preprocessor.final_preprocessing(new_data)
    
    # # Make predictions
    # model3 = load_model('./transformers/model_psi_split.h5')
    # predictions = model3.predict(X_new_preprocessed)
    # print(f"Prediction results shape: {predictions.shape}")

    # ---------- Test 4: DTW/Pad Mode with PSI-based Split ----------
    print("\n\n=== Test 4: DTW/Pad Mode with PSI-based Split ===")
    
    # Clean transformers directory
    shutil.rmtree('./transformers', ignore_errors=True)
    os.makedirs('./transformers', exist_ok=True)
    
    # Configure the preprocessor for training with DTW mode
    preprocessor = DataPreprocessor(
        model_type="LSTM",
        y_variable=config["features"]["y_variable"],
        ordinal_categoricals=config["features"]["ordinal_categoricals"],
        nominal_categoricals=config["features"]["nominal_categoricals"],
        numericals=config["features"]["numericals"],
        mode="train",
        options={
            "enabled": True,
            "time_column": "biomech_datetime",
            "horizon": 10,
            "step_size": 1,
            "sequence_modes": {
                "pad": {
                    "padding_side": "post"
                }
            },
            "ts_sequence_mode": "pad",
            "psi_feature_selection": {
                "enabled": True,
                "threshold": 0.25,
                "split_frac": 0.75,
                "split_distinct": False,
                "apply_before_split": True
            },
            "feature_engine_split": {
                "enabled": True,
                "split_frac": 0.75,
                "split_distinct": False
            },
            "time_series_split": {
                "method": "feature_engine"
            }
        },
        sequence_categorical=["session_biomech", "trial_biomech"],
        sequence_dtw_or_pad_categorical=["pitch_phase_biomech"],
        time_series_sequence_mode="pad",
        debug=True,
        graphs_output_dir="./plots"
    )
    
    # Preprocess the training data
    X_train_seq, X_test_seq, y_train_seq, y_test_seq, recommendations, _ = preprocessor.final_preprocessing(data)
    
    print(f"Train shapes - X: {X_train_seq.shape}, y: {y_train_seq.shape}")
    print(f"Test shapes - X: {X_test_seq.shape}, y: {y_test_seq.shape}")
    
    # Train a model
    print("Training LSTM model with DTW/Pad mode...")
    # Extract horizon from preprocessor
    horizon = get_horizon_from_preprocessor(preprocessor)
    print(f"Using horizon of {horizon} for model output dimension")

    # Build the LSTM model using the extracted horizon
    model4 = build_lstm_model((X_train_seq.shape[1], X_train_seq.shape[2]), horizon=horizon)

    model4.fit(
        X_train_seq, y_train_seq, 
        validation_data=(X_test_seq, y_test_seq),
        epochs=10, batch_size=32, verbose=1
    )
    model4.save('./transformers/model_dtw_pad.h5')
    
    # Test prediction mode
    print("\nTesting prediction mode with new data...")
    
    # Take the last segment of data as "new" data for prediction
    new_data = data.iloc[-48:].copy()  
    
    # Configure the preprocessor for prediction
    predict_preprocessor = DataPreprocessor(
        model_type="LSTM",
        y_variable=config["features"]["y_variable"],
        ordinal_categoricals=config["features"]["ordinal_categoricals"],
        nominal_categoricals=config["features"]["nominal_categoricals"],
        numericals=config["features"]["numericals"],
        mode="predict",
        options={
            "enabled": True,
            "time_column": "biomech_datetime",
            "ts_sequence_mode": "pad"
        },
        sequence_categorical=["session_biomech", "trial_biomech"],
        sequence_dtw_or_pad_categorical=["pitch_phase_biomech"],
        time_series_sequence_mode="pad",
        transformers_dir="./transformers"
    )
    
    # Preprocess new data for prediction
    X_new_preprocessed, _, _ = predict_preprocessor.final_preprocessing(new_data)
    
    # Make predictions
    model4 = load_model('./transformers/model_dtw_pad.h5')
    predictions = model4.predict(X_new_preprocessed)
    print(f"Prediction results shape: {predictions.shape}")
    
    print("\n\nAll tests completed successfully!")



    # Clean transformers directory
    shutil.rmtree('./transformers', ignore_errors=True)
    os.makedirs('./transformers', exist_ok=True)
    
    # Configure the preprocessor for training with DTW mode
    preprocessor = DataPreprocessor(
        model_type="LSTM",
        y_variable=config["features"]["y_variable"],
        ordinal_categoricals=config["features"]["ordinal_categoricals"],
        nominal_categoricals=config["features"]["nominal_categoricals"],
        numericals=config["features"]["numericals"],
        mode="train",
        options={
            "enabled": True,
            "time_column": "biomech_datetime",
            "horizon": 10,
            "step_size": 1,
            "sequence_modes": {
                "dtw": {
                    "reference_sequence": "max",  # Use mean sequence as reference
                    "dtw_threshold": 0.2          # DTW threshold for sequences
                }
            },
            "ts_sequence_mode": "dtw",
            "psi_feature_selection": {
                "enabled": True,
                "threshold": 0.25,
                "split_frac": 0.75,
                "split_distinct": False,
                "apply_before_split": True
            },
            "feature_engine_split": {
                "enabled": True,
                "split_frac": 0.75,
                "split_distinct": False
            },
            "time_series_split": {
                "method": "feature_engine"
            }
        },
        sequence_categorical=["session_biomech", "trial_biomech"],
        sequence_dtw_or_pad_categorical=["pitch_phase_biomech"],
        time_series_sequence_mode="dtw",
        debug=True,
        graphs_output_dir="./plots"
    )
    
    # Preprocess the training data
    X_train_seq, X_test_seq, y_train_seq, y_test_seq, recommendations, _ = preprocessor.final_preprocessing(data)
    
    print(f"Train shapes - X: {X_train_seq.shape}, y: {y_train_seq.shape}")
    print(f"Test shapes - X: {X_test_seq.shape}, y: {y_test_seq.shape}")
    
    # Train a model
    print("Training LSTM model with DTW/Pad mode...")
    # Extract horizon from preprocessor
    horizon = get_horizon_from_preprocessor(preprocessor)
    print(f"Using horizon of {horizon} for model output dimension")

    # Build the LSTM model using the extracted horizon
    model5 = build_lstm_model((X_train_seq.shape[1], X_train_seq.shape[2]), horizon=horizon)

    model5.fit(
        X_train_seq, y_train_seq, 
        validation_data=(X_test_seq, y_test_seq),
        epochs=10, batch_size=32, verbose=1
    )
    model5.save('./transformers/model_dtw.h5')
    
    # Test prediction mode
    print("\nTesting prediction mode with new data...")
    
    # Take the last segment of data as "new" data for prediction
    new_data = data.iloc[-48:].copy()  
    
    # Configure the preprocessor for prediction
    predict_preprocessor = DataPreprocessor(
        model_type="LSTM",
        y_variable=config["features"]["y_variable"],
        ordinal_categoricals=config["features"]["ordinal_categoricals"],
        nominal_categoricals=config["features"]["nominal_categoricals"],
        numericals=config["features"]["numericals"],
        mode="predict",
        options={
            "enabled": True,
            "time_column": "biomech_datetime",
            "ts_sequence_mode": "dtw"
        },
        sequence_categorical=["session_biomech", "trial_biomech"],
        sequence_dtw_or_pad_categorical=["pitch_phase_biomech"],
        time_series_sequence_mode="dtw",
        transformers_dir="./transformers"
    )
    
    # Preprocess new data for prediction
    X_new_preprocessed, _, _ = predict_preprocessor.final_preprocessing(new_data)
    
    # Make predictions
    model5 = load_model('./transformers/model_dtw.h5')
    predictions = model4.predict(X_new_preprocessed)
    print(f"Prediction results shape: {predictions.shape}")
    
    print("\n\nAll tests completed successfully!")

    # Test 5: Pad Mode with Percentage-Based Sequence-Aware Split
    print("\n\n=== Test 5: Pad Mode with Percentage-Based Sequence-Aware Split ===")

    # Clean transformers directory
    shutil.rmtree('./transformers', ignore_errors=True)
    os.makedirs('./transformers', exist_ok=True)

    # Configure preprocessor for training with pad mode and percentage-based split
    pad_pct_preprocessor = DataPreprocessor(
        model_type="LSTM",
        y_variable=config["features"]["y_variable"],
        ordinal_categoricals=config["features"]["ordinal_categoricals"],
        nominal_categoricals=config["features"]["nominal_categoricals"],
        numericals=config["features"]["numericals"],
        mode="train",
        options={
            "enabled": True,
            "time_column": "biomech_datetime",
            "horizon": 10,
            "step_size": 1,
            "sequence_modes": {
                "pad": {
                    "padding_side": "post"
                }
            },
            "time_series_split": {
                "method": "sequence_aware",  # Use sequence-aware splitting
                # "test_size": 0.2,            # Use 20% of sequences for testing
                'target_train_fraction': 0.8,  # Aim for 80% training, 20% testing
                "debug_phases": True         # Enable detailed phase debugging
            }
        },
        sequence_categorical=["session_biomech", "trial_biomech"],
        sequence_dtw_or_pad_categorical=["pitch_phase_biomech"],
        time_series_sequence_mode="pad",
        debug=True,
        graphs_output_dir="./plots"
    )

    # Preprocess training data
    X_train_seq, X_test_seq, y_train_seq, y_test_seq, recommendations, _ = pad_pct_preprocessor.final_preprocessing(data)

    print(f"Train shapes - X: {X_train_seq.shape}, y: {y_train_seq.shape}")
    print(f"Test shapes - X: {X_test_seq.shape}, y: {y_test_seq.shape}")

    # Train model
    print("Training LSTM model with pad mode and percentage-based split...")
    horizon = get_horizon_from_preprocessor(pad_pct_preprocessor)
    print(f"Using horizon of {horizon} for model output dimension")

    model5 = build_lstm_model((X_train_seq.shape[1], X_train_seq.shape[2]), horizon=horizon)
    model5.fit(
        X_train_seq, y_train_seq, 
        validation_data=(X_test_seq, y_test_seq),
        epochs=10, batch_size=32, verbose=1
    )
    model5.save('./transformers/model_pad_pct.h5')

    # Test prediction
    print("\nTesting prediction mode with new data...")
    new_data = data.iloc[-48:].copy()

    pad_pct_predict = DataPreprocessor(
        model_type="LSTM",
        y_variable=config["features"]["y_variable"],
        ordinal_categoricals=config["features"]["ordinal_categoricals"],
        nominal_categoricals=config["features"]["nominal_categoricals"],
        numericals=config["features"]["numericals"],
        mode="predict",
        options={
            "enabled": True,
            "time_column": "biomech_datetime",
            "time_series_split": {
                "method": "sequence_aware",  # Use sequence-aware splitting
                # "test_size": 0.2,            # Use 20% of sequences for testing
                'target_train_fraction': 0.8,  # Aim for 80% training, 20% testing
                "debug_phases": True         # Enable detailed phase debugging
            }
        },
        sequence_categorical=["session_biomech", "trial_biomech"],
        sequence_dtw_or_pad_categorical=["pitch_phase_biomech"],
        time_series_sequence_mode="pad",
        transformers_dir="./transformers"
    )

    X_new_preprocessed, _, _ = pad_pct_predict.final_preprocessing(new_data)
    model5 = load_model('./transformers/model_pad_pct.h5')
    predictions = model5.predict(X_new_preprocessed)
    print(f"Prediction results shape: {predictions.shape}")

    # Test 6: Pad Mode with Date-Based Sequence-Aware Split
    print("\n\n=== Test 6: Pad Mode with Date-Based Sequence-Aware Split ===")

    # Clean transformers directory
    shutil.rmtree('./transformers', ignore_errors=True)
    os.makedirs('./transformers', exist_ok=True)

    # Calculate median date for splitting
    median_date = data['biomech_datetime'].median()
    print(f"Using median date as split point: {median_date}")

    # Configure preprocessor for training with pad mode and date-based split
    pad_date_preprocessor = DataPreprocessor(
        model_type="LSTM",
        y_variable=config["features"]["y_variable"],
        ordinal_categoricals=config["features"]["ordinal_categoricals"],
        nominal_categoricals=config["features"]["nominal_categoricals"],
        numericals=config["features"]["numericals"],
        mode="train",
        options={
            "enabled": True,
            "time_column": "biomech_datetime",
            "horizon": 10,
            "step_size": 1,
            "sequence_modes": {
                "pad": {
                    "padding_side": "post"
                }
            },
            "time_series_split": {
                "method": "sequence_aware",   # Use sequence-aware splitting
                "split_date": str(median_date), # Split at the median date
                "debug_phases": True          # Enable detailed phase debugging
            }
        },
        sequence_categorical=["session_biomech", "trial_biomech"],
        sequence_dtw_or_pad_categorical=["pitch_phase_biomech"],
        time_series_sequence_mode="pad",
        debug=True,
        graphs_output_dir="./plots"
    )

    # Analyze potential split points first
    print("Analyzing potential split points...")
    split_options = pad_date_preprocessor.analyze_split_options(data)
    for i, option in enumerate(split_options[:3]):  # Show top 3
        print(f"Option {i+1}: Split at {option['split_time']} - Train fraction: {option['train_fraction']:.2f}")

    # Preprocess training data
    X_train_seq, X_test_seq, y_train_seq, y_test_seq, recommendations, _ = pad_date_preprocessor.final_preprocessing(data)

    print(f"Train shapes - X: {X_train_seq.shape}, y: {y_train_seq.shape}")
    print(f"Test shapes - X: {X_test_seq.shape}, y: {y_test_seq.shape}")

    # Train model
    print("Training LSTM model with pad mode and date-based split...")
    horizon = get_horizon_from_preprocessor(pad_date_preprocessor)
    print(f"Using horizon of {horizon} for model output dimension")

    model6 = build_lstm_model((X_train_seq.shape[1], X_train_seq.shape[2]), horizon=horizon)
    model6.fit(
        X_train_seq, y_train_seq, 
        validation_data=(X_test_seq, y_test_seq),
        epochs=10, batch_size=32, verbose=1
    )
    model6.save('./transformers/model_pad_date.h5')

    # Test prediction
    print("\nTesting prediction mode with new data...")
    new_data = data.iloc[-48:].copy()

    pad_date_predict = DataPreprocessor(
        model_type="LSTM",
        y_variable=config["features"]["y_variable"],
        ordinal_categoricals=config["features"]["ordinal_categoricals"],
        nominal_categoricals=config["features"]["nominal_categoricals"],
        numericals=config["features"]["numericals"],
        mode="predict",
        options={
            "enabled": True,
            "time_column": "biomech_datetime",
            "time_series_split": {
                "method": "sequence_aware",   # Use sequence-aware splitting
                "split_date": str(median_date), # Split at the median date
                "debug_phases": True          # Enable detailed phase debugging
            }
        },
        sequence_categorical=["session_biomech", "trial_biomech"],
        sequence_dtw_or_pad_categorical=["pitch_phase_biomech"],
        time_series_sequence_mode="pad",
        transformers_dir="./transformers"
    )

    X_new_preprocessed, _, _ = pad_date_predict.final_preprocessing(new_data)
    model6 = load_model('./transformers/model_pad_date.h5')
    predictions = model6.predict(X_new_preprocessed)
    print(f"Prediction results shape: {predictions.shape}")

    # Test 7: DTW Mode with Percentage-Based Sequence-Aware Split
    print("\n\n=== Test 7: DTW Mode with Percentage-Based Sequence-Aware Split ===")

    # Clean transformers directory
    shutil.rmtree('./transformers', ignore_errors=True)
    os.makedirs('./transformers', exist_ok=True)

    # Configure preprocessor for training with DTW mode and percentage-based split
    dtw_pct_preprocessor = DataPreprocessor(
        model_type="LSTM",
        y_variable=config["features"]["y_variable"],
        ordinal_categoricals=config["features"]["ordinal_categoricals"],
        nominal_categoricals=config["features"]["nominal_categoricals"],
        numericals=config["features"]["numericals"],
        mode="train",
        options={
            "enabled": True,
            "time_column": "biomech_datetime",
            "horizon": 10,
            "step_size": 1,
            "sequence_modes": {
                "dtw": {
                    "reference_sequence": "max",  # Use max length sequence as reference
                    "dtw_threshold": 0.2          # DTW threshold for sequences
                }
            },
            "time_series_split": {
                "method": "sequence_aware",  # Use sequence-aware splitting
                # "test_size": 0.2,            # Use 20% of sequences for testing
                'target_train_fraction': 0.75,  # Aim for 80% training, 20% testing
                "debug_phases": True         # Enable detailed phase debugging
            }
        },
        sequence_categorical=["session_biomech", "trial_biomech"],
        sequence_dtw_or_pad_categorical=["pitch_phase_biomech"],
        time_series_sequence_mode="dtw",
        debug=True,
        graphs_output_dir="./plots"
    )

    # Preprocess training data
    X_train_seq, X_test_seq, y_train_seq, y_test_seq, recommendations, _ = dtw_pct_preprocessor.final_preprocessing(data)

    print(f"Train shapes - X: {X_train_seq.shape}, y: {y_train_seq.shape}")
    print(f"Test shapes - X: {X_test_seq.shape}, y: {y_test_seq.shape}")

    # Train model
    print("Training LSTM model with DTW mode and percentage-based split...")
    horizon = get_horizon_from_preprocessor(dtw_pct_preprocessor)
    print(f"Using horizon of {horizon} for model output dimension")

    model7 = build_lstm_model((X_train_seq.shape[1], X_train_seq.shape[2]), horizon=horizon)
    model7.fit(
        X_train_seq, y_train_seq, 
        validation_data=(X_test_seq, y_test_seq),
        epochs=10, batch_size=32, verbose=1
    )
    model7.save('./transformers/model_dtw_pct.h5')

    # Test prediction
    print("\nTesting prediction mode with new data...")
    new_data = data.iloc[-48:].copy()

    dtw_pct_predict = DataPreprocessor(
        model_type="LSTM",
        y_variable=config["features"]["y_variable"],
        ordinal_categoricals=config["features"]["ordinal_categoricals"],
        nominal_categoricals=config["features"]["nominal_categoricals"],
        numericals=config["features"]["numericals"],
        mode="predict",
        options={
            "enabled": True,
            "time_column": "biomech_datetime",
            "time_series_split": {
                "method": "sequence_aware",  # Use sequence-aware splitting
                # "test_size": 0.2,            # Use 20% of sequences for testing
                'target_train_fraction': 0.75,  # Aim for 80% training, 20% testing
                "debug_phases": True         # Enable detailed phase debugging
            }
        },
        sequence_categorical=["session_biomech", "trial_biomech"],
        sequence_dtw_or_pad_categorical=["pitch_phase_biomech"],
        time_series_sequence_mode="dtw",
        transformers_dir="./transformers"
    )

    X_new_preprocessed, _, _ = dtw_pct_predict.final_preprocessing(new_data)
    model7 = load_model('./transformers/model_dtw_pct.h5')
    predictions = model7.predict(X_new_preprocessed)
    print(f"Prediction results shape: {predictions.shape}")

    # Test 8: DTW Mode with Date-Based Sequence-Aware Split
    print("\n\n=== Test 8: DTW Mode with Date-Based Sequence-Aware Split ===")

    # Clean transformers directory
    shutil.rmtree('./transformers', ignore_errors=True)
    os.makedirs('./transformers', exist_ok=True)

    # Use a date 1/3 through the dataset for splitting
    first_date = data['biomech_datetime'].min()
    last_date = data['biomech_datetime'].max()
    date_range = (last_date - first_date).total_seconds()
    split_date = first_date + pd.Timedelta(seconds=date_range * 2/3)
    print(f"Using calculated split date: {split_date}")

    # Configure preprocessor for training with DTW mode and date-based split
    dtw_date_preprocessor = DataPreprocessor(
        model_type="LSTM",
        y_variable=config["features"]["y_variable"],
        ordinal_categoricals=config["features"]["ordinal_categoricals"],
        nominal_categoricals=config["features"]["nominal_categoricals"],
        numericals=config["features"]["numericals"],
        mode="train",
        options={
            "enabled": True,
            "time_column": "biomech_datetime",
            "horizon": 10,
            "step_size": 1,
            "sequence_modes": {
                "dtw": {
                    "reference_sequence": "max",  # Use max length sequence as reference
                    "dtw_threshold": 0.2          # DTW threshold for sequences
                }
            },
            "time_series_split": {
                "method": "sequence_aware",      # Use sequence-aware splitting
                "split_date": str(split_date),   # Split at the calculated date
                "debug_phases": True             # Enable detailed phase debugging
            }
        },
        sequence_categorical=["session_biomech", "trial_biomech"],
        sequence_dtw_or_pad_categorical=["pitch_phase_biomech"],
        time_series_sequence_mode="dtw",
        debug=True,
        graphs_output_dir="./plots"
    )

    # Analyze potential split points first
    print("Analyzing potential split points...")
    split_options = dtw_date_preprocessor.analyze_split_options(data)
    for i, option in enumerate(split_options[:3]):  # Show top 3
        print(f"Option {i+1}: Split at {option['split_time']} - Train fraction: {option['train_fraction']:.2f}")

    # Preprocess training data
    X_train_seq, X_test_seq, y_train_seq, y_test_seq, recommendations, _ = dtw_date_preprocessor.final_preprocessing(data)

    print(f"Train shapes - X: {X_train_seq.shape}, y: {y_train_seq.shape}")
    print(f"Test shapes - X: {X_test_seq.shape}, y: {y_test_seq.shape}")

    # Train model
    print("Training LSTM model with DTW mode and date-based split...")
    horizon = get_horizon_from_preprocessor(dtw_date_preprocessor)
    print(f"Using horizon of {horizon} for model output dimension")

    model8 = build_lstm_model((X_train_seq.shape[1], X_train_seq.shape[2]), horizon=horizon)
    model8.fit(
        X_train_seq, y_train_seq, 
        validation_data=(X_test_seq, y_test_seq),
        epochs=10, batch_size=32, verbose=1
    )
    model8.save('./transformers/model_dtw_date.h5')

    # Test prediction
    print("\nTesting prediction mode with new data...")
    new_data = data.iloc[-48:].copy()

    dtw_date_predict = DataPreprocessor(
        model_type="LSTM",
        y_variable=config["features"]["y_variable"],
        ordinal_categoricals=config["features"]["ordinal_categoricals"],
        nominal_categoricals=config["features"]["nominal_categoricals"],
        numericals=config["features"]["numericals"],
        mode="predict",
        options={
            "enabled": True,
            "time_column": "biomech_datetime",
            "time_series_split": {
                "method": "sequence_aware",      # Use sequence-aware splitting
                "split_date": str(split_date),   # Split at the calculated date
                "debug_phases": True             # Enable detailed phase debugging
            }
        },
        sequence_categorical=["session_biomech", "trial_biomech"],
        sequence_dtw_or_pad_categorical=["pitch_phase_biomech"],
        time_series_sequence_mode="dtw",
        transformers_dir="./transformers"
    )

    X_new_preprocessed, _, _ = dtw_date_predict.final_preprocessing(new_data)
    model8 = load_model('./transformers/model_dtw_date.h5')
    predictions = model8.predict(X_new_preprocessed)
    print(f"Prediction results shape: {predictions.shape}")


    print("\n\nAll tests completed successfully!")




2025-03-03 19:19:13,248 [DEBUG] Initialized ordinal_categoricals: []
2025-03-03 19:19:13,248 [DEBUG] Initialized nominal_categoricals: ['athlete_name_biomech', 'athlete_traq_biomech', 'athlete_level_biomech', 'lab_biomech', 'handedness_biomech', 'pitch_phase_biomech']
2025-03-03 19:19:13,259 [DEBUG] Initialized numericals: ['Collection Length (seconds)', 'EMG 1 (mV) - FDS (81770)', 'ACC X (G) - FDS (81770)', 'ACC Y (G) - FDS (81770)', 'ACC Z (G) - FDS (81770)', 'GYRO X (deg/s) - FDS (81770)', 'GYRO Y (deg/s) - FDS (81770)', 'GYRO Z (deg/s) - FDS (81770)', 'EMG 1 (mV) - FCU (81728)', 'ACC X (G) - FCU (81728)', 'ACC Y (G) - FCU (81728)', 'ACC Z (G) - FCU (81728)', 'GYRO X (deg/s) - FCU (81728)', 'GYRO Y (deg/s) - FCU (81728)', 'GYRO Z (deg/s) - FCU (81728)', 'EMG 1 (mV) - FCR (81745)', 'shoulder_angle_x_biomech', 'shoulder_angle_y_biomech', 'shoulder_angle_z_biomech', 'elbow_angle_x_biomech', 'elbow_angle_y_biomech', 'elbow_angle_z_biomech', 'torso_angle_x_biomech', 'torso_angle_y_biomec

[          '2025-02-14 11:42:22', '2025-02-14 11:42:22.004978048',
 '2025-02-14 11:42:22.009956097', '2025-02-14 11:42:22.014934145',
 '2025-02-14 11:42:22.019912193', '2025-02-14 11:42:22.024890242',
 '2025-02-14 11:42:22.029868290', '2025-02-14 11:42:22.034846338',
 '2025-02-14 11:42:22.039824387', '2025-02-14 11:42:22.044802435',
 ...
 '2025-02-14 11:53:32.592890066', '2025-02-14 11:53:32.597868115',
 '2025-02-14 11:53:32.602846163', '2025-02-14 11:53:32.607824211',
 '2025-02-14 11:53:32.612802260', '2025-02-14 11:53:32.617780308',
 '2025-02-14 11:53:32.622758356', '2025-02-14 11:53:32.627736405',
 '2025-02-14 11:53:32.632714453', '2025-02-14 11:53:32.637692501']
Length: 134720, dtype: datetime64[ns]

Dataset columns:
- EMG 1 (mV) - FDS (81770)
- ACC X (G) - FDS (81770)
- ACC Y (G) - FDS (81770)
- ACC Z (G) - FDS (81770)
- GYRO X (deg/s) - FDS (81770)
- GYRO Y (deg/s) - FDS (81770)
- GYRO Z (deg/s) - FDS (81770)
- EMG 1 (mV) - FCU (81728)
- ACC X (G) - FCU (81728)
- ACC Y (G) - FCU 

2025-03-03 19:19:13,323 [INFO] Applying median-based outlier replacement for time series
2025-03-03 19:19:13,330 [DEBUG] Replaced 0 outliers in column 'Collection Length (seconds)' using median method.
2025-03-03 19:19:13,332 [DEBUG] Replaced 306 outliers in column 'EMG 1 (mV) - FDS (81770)' using median method.
2025-03-03 19:19:13,344 [DEBUG] Replaced 160 outliers in column 'ACC X (G) - FDS (81770)' using median method.
2025-03-03 19:19:13,354 [DEBUG] Replaced 261 outliers in column 'ACC Y (G) - FDS (81770)' using median method.
2025-03-03 19:19:13,363 [DEBUG] Replaced 222 outliers in column 'ACC Z (G) - FDS (81770)' using median method.
2025-03-03 19:19:13,373 [DEBUG] Replaced 135 outliers in column 'GYRO X (deg/s) - FDS (81770)' using median method.
2025-03-03 19:19:13,387 [DEBUG] Replaced 123 outliers in column 'GYRO Y (deg/s) - FDS (81770)' using median method.
2025-03-03 19:19:13,400 [DEBUG] Replaced 96 outliers in column 'GYRO Z (deg/s) - FDS (81770)' using median method.
2025-0

Train shapes - X: (17, 379, 50), y: (17, 10)
Test shapes - X: (4, 379, 50), y: (4, 10)
Training LSTM model with DTW/Pad mode...
Using horizon of 10 for model output dimension
Epoch 1/10
Epoch 2/10
Epoch 3/10
Epoch 4/10
Epoch 5/10
Epoch 6/10
Epoch 7/10
Epoch 8/10
Epoch 9/10
Epoch 10/10


2025-03-03 19:19:20,759 [INFO] Starting: Final Preprocessing Pipeline in 'predict' mode.
2025-03-03 19:19:20,759 [INFO] Step: filter_columns
2025-03-03 19:19:20,763 [INFO] ✅ Filtered DataFrame to include only specified features. Shape: (48, 59)
2025-03-03 19:19:20,763 [INFO] ✅ Column filtering completed successfully.
2025-03-03 19:19:20,764 [INFO] Preprocessing time series data for prediction
2025-03-03 19:19:20,765 [INFO] Step: Load Transformers
2025-03-03 19:19:20,767 [INFO] Transformers loaded successfully from './transformers\transformers.pkl'.
2025-03-03 19:19:20,768 [INFO] Step: Handle Missing Values
2025-03-03 19:19:20,795 [INFO] Using detected phases for prediction: ['arm_acceleration', 'arm_cocking', 'stride', 'windup']
2025-03-03 19:19:20,796 [INFO] Using detected phases for prediction: ['arm_acceleration', 'arm_cocking', 'stride', 'windup']
2025-03-03 19:19:20,796 [INFO] Using detected phases for prediction: ['arm_acceleration', 'arm_cocking', 'stride', 'windup']
2025-03-03 


Testing prediction mode with new data...


2025-03-03 19:19:21,653 [DEBUG] Initialized ordinal_categoricals: []
2025-03-03 19:19:21,653 [DEBUG] Initialized nominal_categoricals: ['athlete_name_biomech', 'athlete_traq_biomech', 'athlete_level_biomech', 'lab_biomech', 'handedness_biomech', 'pitch_phase_biomech']
2025-03-03 19:19:21,653 [DEBUG] Initialized numericals: ['Collection Length (seconds)', 'EMG 1 (mV) - FDS (81770)', 'ACC X (G) - FDS (81770)', 'ACC Y (G) - FDS (81770)', 'ACC Z (G) - FDS (81770)', 'GYRO X (deg/s) - FDS (81770)', 'GYRO Y (deg/s) - FDS (81770)', 'GYRO Z (deg/s) - FDS (81770)', 'EMG 1 (mV) - FCU (81728)', 'ACC X (G) - FCU (81728)', 'ACC Y (G) - FCU (81728)', 'ACC Z (G) - FCU (81728)', 'GYRO X (deg/s) - FCU (81728)', 'GYRO Y (deg/s) - FCU (81728)', 'GYRO Z (deg/s) - FCU (81728)', 'EMG 1 (mV) - FCR (81745)', 'shoulder_angle_x_biomech', 'shoulder_angle_y_biomech', 'shoulder_angle_z_biomech', 'elbow_angle_x_biomech', 'elbow_angle_y_biomech', 'elbow_angle_z_biomech', 'torso_angle_x_biomech', 'torso_angle_y_biomec

Prediction results shape: (1, 10)


All tests completed successfully!


2025-03-03 19:19:21,834 [DEBUG] Replaced 77 outliers in column 'GYRO Y (deg/s) - FCU (81728)' using median method.
2025-03-03 19:19:21,842 [DEBUG] Replaced 75 outliers in column 'GYRO Z (deg/s) - FCU (81728)' using median method.
2025-03-03 19:19:21,849 [DEBUG] Replaced 120 outliers in column 'EMG 1 (mV) - FCR (81745)' using median method.
2025-03-03 19:19:21,863 [DEBUG] Replaced 6 outliers in column 'shoulder_angle_x_biomech' using median method.
2025-03-03 19:19:21,870 [DEBUG] Replaced 11 outliers in column 'shoulder_angle_y_biomech' using median method.
2025-03-03 19:19:21,877 [DEBUG] Replaced 13 outliers in column 'shoulder_angle_z_biomech' using median method.
2025-03-03 19:19:21,886 [DEBUG] Replaced 10 outliers in column 'elbow_angle_x_biomech' using median method.
2025-03-03 19:19:21,894 [DEBUG] Replaced 36 outliers in column 'elbow_angle_y_biomech' using median method.
2025-03-03 19:19:21,904 [DEBUG] Replaced 7 outliers in column 'elbow_angle_z_biomech' using median method.
202

Train shapes - X: (17, 379, 50), y: (17, 10)
Test shapes - X: (4, 379, 50), y: (4, 10)
Training LSTM model with DTW/Pad mode...
Using horizon of 10 for model output dimension
Epoch 1/10
Epoch 2/10
Epoch 3/10
Epoch 4/10
Epoch 5/10
Epoch 6/10
Epoch 7/10
Epoch 8/10
Epoch 9/10
Epoch 10/10


2025-03-03 19:19:31,999 [INFO] Starting: Final Preprocessing Pipeline in 'predict' mode.
2025-03-03 19:19:32,000 [INFO] Step: filter_columns
2025-03-03 19:19:32,003 [INFO] ✅ Filtered DataFrame to include only specified features. Shape: (48, 59)
2025-03-03 19:19:32,006 [INFO] ✅ Column filtering completed successfully.
2025-03-03 19:19:32,006 [INFO] Preprocessing time series data for prediction
2025-03-03 19:19:32,006 [INFO] Step: Load Transformers
2025-03-03 19:19:32,015 [INFO] Transformers loaded successfully from './transformers\transformers.pkl'.
2025-03-03 19:19:32,016 [INFO] Step: Handle Missing Values
2025-03-03 19:19:32,053 [INFO] Using detected phases for prediction: ['arm_acceleration', 'arm_cocking', 'stride', 'windup']
2025-03-03 19:19:32,054 [INFO] Using detected phases for prediction: ['arm_acceleration', 'arm_cocking', 'stride', 'windup']
2025-03-03 19:19:32,055 [INFO] Using detected phases for prediction: ['arm_acceleration', 'arm_cocking', 'stride', 'windup']
2025-03-03 


Testing prediction mode with new data...


2025-03-03 19:19:32,472 [DEBUG] Initialized ordinal_categoricals: []
2025-03-03 19:19:32,473 [DEBUG] Initialized nominal_categoricals: ['athlete_name_biomech', 'athlete_traq_biomech', 'athlete_level_biomech', 'lab_biomech', 'handedness_biomech', 'pitch_phase_biomech']
2025-03-03 19:19:32,473 [DEBUG] Initialized numericals: ['Collection Length (seconds)', 'EMG 1 (mV) - FDS (81770)', 'ACC X (G) - FDS (81770)', 'ACC Y (G) - FDS (81770)', 'ACC Z (G) - FDS (81770)', 'GYRO X (deg/s) - FDS (81770)', 'GYRO Y (deg/s) - FDS (81770)', 'GYRO Z (deg/s) - FDS (81770)', 'EMG 1 (mV) - FCU (81728)', 'ACC X (G) - FCU (81728)', 'ACC Y (G) - FCU (81728)', 'ACC Z (G) - FCU (81728)', 'GYRO X (deg/s) - FCU (81728)', 'GYRO Y (deg/s) - FCU (81728)', 'GYRO Z (deg/s) - FCU (81728)', 'EMG 1 (mV) - FCR (81745)', 'shoulder_angle_x_biomech', 'shoulder_angle_y_biomech', 'shoulder_angle_z_biomech', 'elbow_angle_x_biomech', 'elbow_angle_y_biomech', 'elbow_angle_z_biomech', 'torso_angle_x_biomech', 'torso_angle_y_biomec

Prediction results shape: (1, 10)


All tests completed successfully!


=== Test 5: Pad Mode with Percentage-Based Sequence-Aware Split ===


2025-03-03 19:19:32,644 [DEBUG] Replaced 77 outliers in column 'GYRO Y (deg/s) - FCU (81728)' using median method.
2025-03-03 19:19:32,652 [DEBUG] Replaced 75 outliers in column 'GYRO Z (deg/s) - FCU (81728)' using median method.
2025-03-03 19:19:32,658 [DEBUG] Replaced 120 outliers in column 'EMG 1 (mV) - FCR (81745)' using median method.
2025-03-03 19:19:32,668 [DEBUG] Replaced 6 outliers in column 'shoulder_angle_x_biomech' using median method.
2025-03-03 19:19:32,676 [DEBUG] Replaced 11 outliers in column 'shoulder_angle_y_biomech' using median method.
2025-03-03 19:19:32,685 [DEBUG] Replaced 13 outliers in column 'shoulder_angle_z_biomech' using median method.
2025-03-03 19:19:32,694 [DEBUG] Replaced 10 outliers in column 'elbow_angle_x_biomech' using median method.
2025-03-03 19:19:32,699 [DEBUG] Replaced 36 outliers in column 'elbow_angle_y_biomech' using median method.
2025-03-03 19:19:32,709 [DEBUG] Replaced 7 outliers in column 'elbow_angle_z_biomech' using median method.
202

Train shapes - X: (17, 379, 50), y: (17, 10)
Test shapes - X: (5, 379, 50), y: (5, 10)
Training LSTM model with pad mode and percentage-based split...
Using horizon of 10 for model output dimension
Epoch 1/10
Epoch 2/10
Epoch 3/10
Epoch 4/10
Epoch 5/10
Epoch 6/10
Epoch 7/10
Epoch 8/10
Epoch 9/10
Epoch 10/10

Testing prediction mode with new data...


2025-03-03 19:19:39,740 [INFO] Starting: Final Preprocessing Pipeline in 'predict' mode.
2025-03-03 19:19:39,741 [INFO] Step: filter_columns
2025-03-03 19:19:39,741 [INFO] ✅ Filtered DataFrame to include only specified features. Shape: (48, 59)
2025-03-03 19:19:39,741 [INFO] ✅ Column filtering completed successfully.
2025-03-03 19:19:39,746 [INFO] Preprocessing time series data for prediction
2025-03-03 19:19:39,746 [INFO] Step: Load Transformers
2025-03-03 19:19:39,749 [INFO] Transformers loaded successfully from './transformers\transformers.pkl'.
2025-03-03 19:19:39,749 [INFO] Step: Handle Missing Values
2025-03-03 19:19:39,775 [INFO] Using detected phases for prediction: ['arm_acceleration', 'arm_cocking', 'stride', 'windup']
2025-03-03 19:19:39,781 [INFO] Using detected phases for prediction: ['arm_acceleration', 'arm_cocking', 'stride', 'windup']
2025-03-03 19:19:39,781 [INFO] Using detected phases for prediction: ['arm_acceleration', 'arm_cocking', 'stride', 'windup']
2025-03-03 



2025-03-03 19:19:40,632 [DEBUG] Initialized ordinal_categoricals: []
2025-03-03 19:19:40,634 [DEBUG] Initialized nominal_categoricals: ['athlete_name_biomech', 'athlete_traq_biomech', 'athlete_level_biomech', 'lab_biomech', 'handedness_biomech', 'pitch_phase_biomech']
2025-03-03 19:19:40,636 [DEBUG] Initialized numericals: ['Collection Length (seconds)', 'EMG 1 (mV) - FDS (81770)', 'ACC X (G) - FDS (81770)', 'ACC Y (G) - FDS (81770)', 'ACC Z (G) - FDS (81770)', 'GYRO X (deg/s) - FDS (81770)', 'GYRO Y (deg/s) - FDS (81770)', 'GYRO Z (deg/s) - FDS (81770)', 'EMG 1 (mV) - FCU (81728)', 'ACC X (G) - FCU (81728)', 'ACC Y (G) - FCU (81728)', 'ACC Z (G) - FCU (81728)', 'GYRO X (deg/s) - FCU (81728)', 'GYRO Y (deg/s) - FCU (81728)', 'GYRO Z (deg/s) - FCU (81728)', 'EMG 1 (mV) - FCR (81745)', 'shoulder_angle_x_biomech', 'shoulder_angle_y_biomech', 'shoulder_angle_z_biomech', 'elbow_angle_x_biomech', 'elbow_angle_y_biomech', 'elbow_angle_z_biomech', 'torso_angle_x_biomech', 'torso_angle_y_biomec

Prediction results shape: (1, 10)


=== Test 6: Pad Mode with Date-Based Sequence-Aware Split ===
Using median date as split point: 2025-02-14 11:47:52.067005440
Analyzing potential split points...
Option 1: Split at 2025-02-14 11:42:56.010026167 - Train fraction: 0.05
Option 2: Split at 2025-02-14 11:43:25.007157669 - Train fraction: 0.09
Option 3: Split at 2025-02-14 11:43:43.007780418 - Train fraction: 0.14


2025-03-03 19:19:40,818 [DEBUG] Replaced 222 outliers in column 'ACC Z (G) - FDS (81770)' using median method.
2025-03-03 19:19:40,828 [DEBUG] Replaced 135 outliers in column 'GYRO X (deg/s) - FDS (81770)' using median method.
2025-03-03 19:19:40,836 [DEBUG] Replaced 123 outliers in column 'GYRO Y (deg/s) - FDS (81770)' using median method.
2025-03-03 19:19:40,845 [DEBUG] Replaced 96 outliers in column 'GYRO Z (deg/s) - FDS (81770)' using median method.
2025-03-03 19:19:40,851 [DEBUG] Replaced 347 outliers in column 'EMG 1 (mV) - FCU (81728)' using median method.
2025-03-03 19:19:40,862 [DEBUG] Replaced 127 outliers in column 'ACC X (G) - FCU (81728)' using median method.
2025-03-03 19:19:40,870 [DEBUG] Replaced 176 outliers in column 'ACC Y (G) - FCU (81728)' using median method.
2025-03-03 19:19:40,878 [DEBUG] Replaced 257 outliers in column 'ACC Z (G) - FCU (81728)' using median method.
2025-03-03 19:19:40,885 [DEBUG] Replaced 133 outliers in column 'GYRO X (deg/s) - FCU (81728)' us

Train shapes - X: (11, 354, 50), y: (11, 10)
Test shapes - X: (11, 354, 50), y: (11, 10)
Training LSTM model with pad mode and date-based split...
Using horizon of 10 for model output dimension
Epoch 1/10
Epoch 2/10
Epoch 3/10
Epoch 4/10
Epoch 5/10
Epoch 6/10
Epoch 7/10
Epoch 8/10
Epoch 9/10
Epoch 10/10


2025-03-03 19:19:50,106 [INFO] Starting: Final Preprocessing Pipeline in 'predict' mode.
2025-03-03 19:19:50,106 [INFO] Step: filter_columns
2025-03-03 19:19:50,110 [INFO] ✅ Filtered DataFrame to include only specified features. Shape: (48, 59)
2025-03-03 19:19:50,111 [INFO] ✅ Column filtering completed successfully.
2025-03-03 19:19:50,112 [INFO] Preprocessing time series data for prediction
2025-03-03 19:19:50,112 [INFO] Step: Load Transformers
2025-03-03 19:19:50,117 [INFO] Transformers loaded successfully from './transformers\transformers.pkl'.
2025-03-03 19:19:50,117 [INFO] Step: Handle Missing Values
2025-03-03 19:19:50,164 [INFO] Using detected phases for prediction: ['arm_acceleration', 'arm_cocking', 'stride', 'windup']
2025-03-03 19:19:50,165 [INFO] Using detected phases for prediction: ['arm_acceleration', 'arm_cocking', 'stride', 'windup']
2025-03-03 19:19:50,167 [INFO] Using detected phases for prediction: ['arm_acceleration', 'arm_cocking', 'stride', 'windup']
2025-03-03 


Testing prediction mode with new data...


2025-03-03 19:19:50,184 [INFO] Created placeholder for missing 'windup' with shape (185, 50)
2025-03-03 19:19:50,185 [INFO] Using detected phases for prediction: ['arm_acceleration', 'arm_cocking', 'stride', 'windup']
2025-03-03 19:19:50,187 [INFO] Using detected phases for prediction: ['arm_acceleration', 'arm_cocking', 'stride', 'windup']
2025-03-03 19:19:50,189 [INFO] Using detected phases for prediction: ['arm_acceleration', 'arm_cocking', 'stride', 'windup']
2025-03-03 19:19:50,196 [INFO] ✅ Preprocessing completed successfully in time series predict mode.




2025-03-03 19:19:51,733 [DEBUG] Initialized ordinal_categoricals: []
2025-03-03 19:19:51,733 [DEBUG] Initialized nominal_categoricals: ['athlete_name_biomech', 'athlete_traq_biomech', 'athlete_level_biomech', 'lab_biomech', 'handedness_biomech', 'pitch_phase_biomech']
2025-03-03 19:19:51,733 [DEBUG] Initialized numericals: ['Collection Length (seconds)', 'EMG 1 (mV) - FDS (81770)', 'ACC X (G) - FDS (81770)', 'ACC Y (G) - FDS (81770)', 'ACC Z (G) - FDS (81770)', 'GYRO X (deg/s) - FDS (81770)', 'GYRO Y (deg/s) - FDS (81770)', 'GYRO Z (deg/s) - FDS (81770)', 'EMG 1 (mV) - FCU (81728)', 'ACC X (G) - FCU (81728)', 'ACC Y (G) - FCU (81728)', 'ACC Z (G) - FCU (81728)', 'GYRO X (deg/s) - FCU (81728)', 'GYRO Y (deg/s) - FCU (81728)', 'GYRO Z (deg/s) - FCU (81728)', 'EMG 1 (mV) - FCR (81745)', 'shoulder_angle_x_biomech', 'shoulder_angle_y_biomech', 'shoulder_angle_z_biomech', 'elbow_angle_x_biomech', 'elbow_angle_y_biomech', 'elbow_angle_z_biomech', 'torso_angle_x_biomech', 'torso_angle_y_biomec

Prediction results shape: (1, 10)


=== Test 7: DTW Mode with Percentage-Based Sequence-Aware Split ===


2025-03-03 19:19:51,946 [DEBUG] Replaced 160 outliers in column 'ACC X (G) - FDS (81770)' using median method.
2025-03-03 19:19:51,976 [DEBUG] Replaced 261 outliers in column 'ACC Y (G) - FDS (81770)' using median method.
2025-03-03 19:19:51,995 [DEBUG] Replaced 222 outliers in column 'ACC Z (G) - FDS (81770)' using median method.
2025-03-03 19:19:52,015 [DEBUG] Replaced 135 outliers in column 'GYRO X (deg/s) - FDS (81770)' using median method.
2025-03-03 19:19:52,034 [DEBUG] Replaced 123 outliers in column 'GYRO Y (deg/s) - FDS (81770)' using median method.
2025-03-03 19:19:52,049 [DEBUG] Replaced 96 outliers in column 'GYRO Z (deg/s) - FDS (81770)' using median method.
2025-03-03 19:19:52,064 [DEBUG] Replaced 347 outliers in column 'EMG 1 (mV) - FCU (81728)' using median method.
2025-03-03 19:19:52,083 [DEBUG] Replaced 127 outliers in column 'ACC X (G) - FCU (81728)' using median method.
2025-03-03 19:19:52,098 [DEBUG] Replaced 176 outliers in column 'ACC Y (G) - FCU (81728)' using m

Train shapes - X: (15, 364, 50), y: (15, 10)
Test shapes - X: (7, 364, 50), y: (7, 10)
Training LSTM model with DTW mode and percentage-based split...
Using horizon of 10 for model output dimension
Epoch 1/10
Epoch 2/10
Epoch 3/10
Epoch 4/10
Epoch 5/10
Epoch 6/10
Epoch 7/10
Epoch 8/10
Epoch 9/10
Epoch 10/10


2025-03-03 19:20:05,379 [INFO] Starting: Final Preprocessing Pipeline in 'predict' mode.
2025-03-03 19:20:05,380 [INFO] Step: filter_columns
2025-03-03 19:20:05,384 [INFO] ✅ Filtered DataFrame to include only specified features. Shape: (48, 59)
2025-03-03 19:20:05,384 [INFO] ✅ Column filtering completed successfully.
2025-03-03 19:20:05,385 [INFO] Preprocessing time series data for prediction
2025-03-03 19:20:05,386 [INFO] Step: Load Transformers
2025-03-03 19:20:05,391 [INFO] Transformers loaded successfully from './transformers\transformers.pkl'.
2025-03-03 19:20:05,393 [INFO] Step: Handle Missing Values
2025-03-03 19:20:05,431 [INFO] Using detected phases for prediction: ['arm_acceleration', 'arm_cocking', 'stride', 'windup']
2025-03-03 19:20:05,432 [INFO] Using detected phases for prediction: ['arm_acceleration', 'arm_cocking', 'stride', 'windup']
2025-03-03 19:20:05,433 [INFO] Using detected phases for prediction: ['arm_acceleration', 'arm_cocking', 'stride', 'windup']
2025-03-03 


Testing prediction mode with new data...


2025-03-03 19:20:06,419 [DEBUG] Initialized ordinal_categoricals: []
2025-03-03 19:20:06,419 [DEBUG] Initialized nominal_categoricals: ['athlete_name_biomech', 'athlete_traq_biomech', 'athlete_level_biomech', 'lab_biomech', 'handedness_biomech', 'pitch_phase_biomech']
2025-03-03 19:20:06,420 [DEBUG] Initialized numericals: ['Collection Length (seconds)', 'EMG 1 (mV) - FDS (81770)', 'ACC X (G) - FDS (81770)', 'ACC Y (G) - FDS (81770)', 'ACC Z (G) - FDS (81770)', 'GYRO X (deg/s) - FDS (81770)', 'GYRO Y (deg/s) - FDS (81770)', 'GYRO Z (deg/s) - FDS (81770)', 'EMG 1 (mV) - FCU (81728)', 'ACC X (G) - FCU (81728)', 'ACC Y (G) - FCU (81728)', 'ACC Z (G) - FCU (81728)', 'GYRO X (deg/s) - FCU (81728)', 'GYRO Y (deg/s) - FCU (81728)', 'GYRO Z (deg/s) - FCU (81728)', 'EMG 1 (mV) - FCR (81745)', 'shoulder_angle_x_biomech', 'shoulder_angle_y_biomech', 'shoulder_angle_z_biomech', 'elbow_angle_x_biomech', 'elbow_angle_y_biomech', 'elbow_angle_z_biomech', 'torso_angle_x_biomech', 'torso_angle_y_biomec

Prediction results shape: (1, 10)


=== Test 8: DTW Mode with Date-Based Sequence-Aware Split ===
Using calculated split date: 2025-02-14 11:49:59.067803500
Analyzing potential split points...
Option 1: Split at 2025-02-14 11:42:56.010026167 - Train fraction: 0.05
Option 2: Split at 2025-02-14 11:43:25.007157669 - Train fraction: 0.09
Option 3: Split at 2025-02-14 11:43:43.007780418 - Train fraction: 0.14


2025-03-03 19:20:06,603 [DEBUG] Replaced 160 outliers in column 'ACC X (G) - FDS (81770)' using median method.
2025-03-03 19:20:06,616 [DEBUG] Replaced 261 outliers in column 'ACC Y (G) - FDS (81770)' using median method.
2025-03-03 19:20:06,630 [DEBUG] Replaced 222 outliers in column 'ACC Z (G) - FDS (81770)' using median method.
2025-03-03 19:20:06,641 [DEBUG] Replaced 135 outliers in column 'GYRO X (deg/s) - FDS (81770)' using median method.
2025-03-03 19:20:06,651 [DEBUG] Replaced 123 outliers in column 'GYRO Y (deg/s) - FDS (81770)' using median method.
2025-03-03 19:20:06,662 [DEBUG] Replaced 96 outliers in column 'GYRO Z (deg/s) - FDS (81770)' using median method.
2025-03-03 19:20:06,673 [DEBUG] Replaced 347 outliers in column 'EMG 1 (mV) - FCU (81728)' using median method.
2025-03-03 19:20:06,683 [DEBUG] Replaced 127 outliers in column 'ACC X (G) - FCU (81728)' using median method.
2025-03-03 19:20:06,694 [DEBUG] Replaced 176 outliers in column 'ACC Y (G) - FCU (81728)' using m

Train shapes - X: (17, 379, 50), y: (17, 10)
Test shapes - X: (5, 379, 50), y: (5, 10)
Training LSTM model with DTW mode and date-based split...
Using horizon of 10 for model output dimension
Epoch 1/10
Epoch 2/10
Epoch 3/10
Epoch 4/10
Epoch 5/10
Epoch 6/10
Epoch 7/10
Epoch 8/10
Epoch 9/10
Epoch 10/10


2025-03-03 19:20:19,123 [INFO] Starting: Final Preprocessing Pipeline in 'predict' mode.
2025-03-03 19:20:19,124 [INFO] Step: filter_columns
2025-03-03 19:20:19,128 [INFO] ✅ Filtered DataFrame to include only specified features. Shape: (48, 59)
2025-03-03 19:20:19,129 [INFO] ✅ Column filtering completed successfully.
2025-03-03 19:20:19,130 [INFO] Preprocessing time series data for prediction
2025-03-03 19:20:19,131 [INFO] Step: Load Transformers
2025-03-03 19:20:19,134 [INFO] Transformers loaded successfully from './transformers\transformers.pkl'.
2025-03-03 19:20:19,134 [INFO] Step: Handle Missing Values
2025-03-03 19:20:19,170 [INFO] Using detected phases for prediction: ['arm_acceleration', 'arm_cocking', 'stride', 'windup']
2025-03-03 19:20:19,171 [INFO] Using detected phases for prediction: ['arm_acceleration', 'arm_cocking', 'stride', 'windup']
2025-03-03 19:20:19,172 [INFO] Using detected phases for prediction: ['arm_acceleration', 'arm_cocking', 'stride', 'windup']
2025-03-03 


Testing prediction mode with new data...
Prediction results shape: (1, 10)


All tests completed successfully!


Notes:
dtw is doing best overall but we should add a report to show it vs pad vs set_window

tcn is good with bidirectional not with unidirectional, uni is good by itself. Add this to the report as well. 


In [None]:
# ... existing imports ...
from sklearn.metrics import mean_squared_error, mean_absolute_error, r2_score
import json
from datetime import datetime
import os
import tensorflow as tf

def run_sequence_mode_experiment(data, config, sequence_mode, model_architectures):
    """
    Run experiment for a specific sequence mode with different model architectures.
    
    Args:
        data (pd.DataFrame): Input data.
        config (dict): Configuration dictionary.
        sequence_mode (str): One of ["set_window", "dtw", "pad", "variable_length"].
        model_architectures (list): List of dicts containing model architecture settings.
    
    Returns:
        dict: A dictionary of experiment results.
    """
    results = {}
    
    # Update ts_params with time-series configuration for the experiment.
    # For each mode, we want to:
    #   - Always include the time_column.
    #   - For set_window mode: include window_size and step_size.
    #   - For DTW/pad modes: include max_sequence_length.
    #   - For all modes, we set horizon and the time_series_sequence_mode.
    ts_params = {
        "enabled": True,
        "time_column": "ongoing_timestamp_biomech",
        "window_size": 500,          # Only used for set_window mode.
        "horizon": 100,
        "step_size": 1,              # Only used for set_window mode.
        "max_sequence_length": 500,  # Used in pad mode.
        "time_series_sequence_mode": sequence_mode,
        # Optionally, you can also add a pad_threshold here if needed, e.g.:
        # "pad_threshold": 0.5,
        "handle_outliers": {
            "time_series_method": "none",  # Use "none" to disable custom time-series outlier handling.
        }
    }
    
    print(f"\n{'='*80}")
    print(f"Testing sequence mode: {sequence_mode}")
    print(f"{'='*80}")
    
    try:
        # Create preprocessor with current sequence mode and the appropriate parameters.
        preprocessor = DataPreprocessor(
            model_type="LSTM",  # Assume an LSTM-based model for time series.
            y_variable=config["features"]["y_variable"],
            ordinal_categoricals=config["features"]["ordinal_categoricals"],
            nominal_categoricals=config["features"]["nominal_categoricals"],
            numericals=config["features"]["numericals"],
            mode="train",
            options=ts_params,
            debug=True,
            graphs_output_dir=config["paths"]["plots_output_dir"],
            transformers_dir=config["paths"]["transformers_save_base_dir"],
            time_column=ts_params.get("time_column"),
            window_size=ts_params.get("window_size"), # set_window argument
            horizon=ts_params.get("horizon"),
            step_size=ts_params.get("step_size"), # set_window argument
            max_sequence_length=ts_params.get("max_sequence_length"), # set_window argument
            sequence_categorical=["session_biomech", "trial_biomech"],
            sequence_dtw_or_pad_categorical=["pitch_phase_biomech"],
            time_series_sequence_mode=sequence_mode
        )
        
        # Preprocess the data using the final_preprocessing method.
        # This returns:
        #   X_seq: The processed time-series sequences (as a NumPy array).
        #   y_seq: The aligned target sequences.
        #   recommendations: Preprocessing recommendations.
        X_seq, _, y_seq, _, recommendations, _ = preprocessor.final_preprocessing(data)
        
        # Test each provided model architecture.
        for arch in model_architectures:
            arch_name = f"{'TCN-' if arch['use_tcn'] else ''}{'Bi' if arch['bidirectional'] else ''}LSTM"
            print(f"\nTesting architecture: {arch_name}")
            
            try:
                # Train the model (train_lstm_model should be defined elsewhere)
                model = train_lstm_model(
                    X_seq, 
                    y_seq, 
                    ts_params, 
                    config, 
                    use_tcn=arch['use_tcn'],
                    bidirectional=arch['bidirectional']
                )
                
                # Evaluate the model using the same data (for demonstration).
                eval_metrics = model.evaluate(X_seq, y_seq, verbose=0)
                metric_names = ['loss', 'mae', 'rmse', 'r2', 'mape']
                
                # Store results in the dictionary.
                results[arch_name] = {
                    'metrics': dict(zip(metric_names, eval_metrics)),
                    'sequence_shape': X_seq.shape,
                    'architecture': arch
                }
                
                # Clean up model and clear session.
                del model
                tf.keras.backend.clear_session()
                
            except Exception as e:
                print(f"Error with architecture {arch_name}: {str(e)}")
                results[arch_name] = {'error': str(e)}
        
    except Exception as e:
        print(f"Error with sequence mode {sequence_mode}: {str(e)}")
        results['preprocessing_error'] = str(e)
    
    return results

def save_experiment_results(results, config):
    """Save experiment results to a JSON file."""
    timestamp = datetime.now().strftime("%Y%m%d_%H%M%S")
    filename = f"experiment_results_{timestamp}.json"
    filepath = os.path.join(config["paths"]["training_output_dir"], filename)
    
    def convert_to_serializable(obj):
        if isinstance(obj, np.integer):
            return int(obj)
        elif isinstance(obj, np.floating):
            return float(obj)
        elif isinstance(obj, np.ndarray):
            return obj.tolist()
        return obj
    
    serializable_results = json.loads(json.dumps(results, default=convert_to_serializable))
    
    with open(filepath, 'w') as f:
        json.dump(serializable_results, f, indent=2)
    
    print(f"Results saved to {filepath}")

def print_experiment_summary(all_results):
    """Print a summary of all experiment results."""
    print("\n" + "="*100)
    print("EXPERIMENT SUMMARY")
    print("="*100)
    
    for sequence_mode, results in all_results.items():
        print(f"\nSequence Mode: {sequence_mode}")
        print("-" * 80)
        
        if 'preprocessing_error' in results:
            print(f"ERROR: {results['preprocessing_error']}")
            continue
            
        for arch_name, arch_results in results.items():
            if 'error' in arch_results:
                print(f"{arch_name}: ERROR - {arch_results['error']}")
                continue
                
            metrics = arch_results['metrics']
            print(f"\n{arch_name}:")
            print(f"  Sequence Shape: {arch_results['sequence_shape']}")
            print(f"  MAE: {metrics['mae']:.4f}")
            print(f"  RMSE: {metrics['rmse']:.4f}")
            print(f"  R²: {metrics['r2']:.4f}")
            print(f"  MAPE: {metrics['mape']:.4f}")

# Define model architectures to test
model_architectures = [
    # {'use_tcn': False, 'bidirectional': False},  # LSTM
    # {'use_tcn': False, 'bidirectional': True},   # BiLSTM
    # {'use_tcn': True, 'bidirectional': False},   # TCN-LSTM
    {'use_tcn': True, 'bidirectional': True}     # TCN-BiLSTM
]

# Sequence modes to test
sequence_modes = ["set_window", "dtw", "pad", "variable_length"]

# Run experiments for each sequence mode and collect results.
all_results = {}
for mode in sequence_modes:
    all_results[mode] = run_sequence_mode_experiment(data, config, mode, model_architectures)

# Print summary of all experiments.
print_experiment_summary(all_results)


NameError: name 'data' is not defined

In [None]:
# # Implementing GASF-DCAE-DNN Model for Injury Prediction

# The GASF-DCAE-DNN (Gramian Angular Summation Field-Deep Convolutional Auto-Encoder-Deep Neural Network) model represents a significant advancement in sports injury prediction. Below is a comprehensive implementation guide to integrate this innovative approach into your existing experimental framework.

# ## Implementation Overview

# The implementation requires several key components:
# 1. Time series to image encoding using GASF
# 2. A deep convolutional autoencoder for feature learning
# 3. A deep neural network classifier for final prediction
# 4. Integration with your existing experimental framework

# ```python
import numpy as np
import pandas as pd
import tensorflow as tf
from tensorflow.keras.layers import Input, Conv2D, MaxPooling2D, UpSampling2D, Dense, Flatten, Dropout, Reshape
from tensorflow.keras.models import Model
from tensorflow.keras.optimizers import Adam
from sklearn.model_selection import train_test_split
from sklearn.metrics import roc_auc_score, confusion_matrix
from pyts.image import GramianAngularField
import matplotlib.pyplot as plt
# ```

# ## Step 1: Time Series to GASF Image Transformation

# First, we need to add a function that transforms time series data into GASF images:

# ```python
def create_gasf_images(X_seq, image_size=64):
    """
    Transform time series data into Gramian Angular Summation Field (GASF) images.
    
    Args:
        X_seq (np.ndarray): Time series data of shape (n_samples, sequence_length, n_features)
        image_size (int): Size of the resulting square images
        
    Returns:
        np.ndarray: GASF images of shape (n_samples, n_features, image_size, image_size)
    """
    n_samples, seq_length, n_features = X_seq.shape
    
    # Initialize transformer
    gasf = GramianAngularField(image_size=image_size, method='summation')
    
    # Initialize the result array
    result = np.zeros((n_samples, n_features, image_size, image_size))
    
    # Transform each feature for each sample
    for i in range(n_samples):
        for j in range(n_features):
            # Extract the time series for this feature
            time_series = X_seq[i, :, j]
            
            # Reshape for pyts which expects shape (n_samples, n_timepoints)
            time_series = time_series.reshape(1, -1)
            
            # Transform to GASF
            gasf_image = gasf.fit_transform(time_series)
            
            # Store the result
            result[i, j] = gasf_image[0]
    
    return result
# ```

# ## Step 2: Deep Convolutional Auto-Encoder (DCAE)

# Next, we implement the DCAE model to extract features from the GASF images:

# ```python
def build_dcae_model(input_shape, latent_dim=32):
    """
    Build a Deep Convolutional Auto-Encoder model.
    
    Args:
        input_shape (tuple): Shape of input images (height, width, channels)
        latent_dim (int): Dimension of the latent representation
        
    Returns:
        tuple: (full_model, encoder_model) - both are Keras models
    """
    # Input layer
    input_img = Input(shape=input_shape)
    
    # Encoder
    x = Conv2D(32, (3, 3), activation='relu', padding='same')(input_img)
    x = MaxPooling2D((2, 2), padding='same')(x)
    x = Conv2D(64, (3, 3), activation='relu', padding='same')(x)
    x = MaxPooling2D((2, 2), padding='same')(x)
    x = Conv2D(128, (3, 3), activation='relu', padding='same')(x)
    x = MaxPooling2D((2, 2), padding='same')(x)
    
    # Flatten the output and create the latent representation
    flattened = Flatten()(x)
    encoded = Dense(latent_dim, activation='relu')(flattened)
    
    # Reshape for the decoder
    reshaped = Dense(input_shape[0]//8 * input_shape[1]//8 * 128, activation='relu')(encoded)
    reshaped = Reshape((input_shape[0]//8, input_shape[1]//8, 128))(reshaped)
    
    # Decoder
    x = Conv2D(128, (3, 3), activation='relu', padding='same')(reshaped)
    x = UpSampling2D((2, 2))(x)
    x = Conv2D(64, (3, 3), activation='relu', padding='same')(x)
    x = UpSampling2D((2, 2))(x)
    x = Conv2D(32, (3, 3), activation='relu', padding='same')(x)
    x = UpSampling2D((2, 2))(x)
    decoded = Conv2D(input_shape[2], (3, 3), activation='sigmoid', padding='same')(x)
    
    # Full autoencoder model
    autoencoder = Model(input_img, decoded)
    
    # Encoder model (for extracting features)
    encoder = Model(input_img, encoded)
    
    return autoencoder, encoder
# ```

# ## Step 3: Deep Neural Network (DNN) for Injury Prediction

# Now, we implement the DNN classifier that will use the extracted features:

# ```python
def build_dnn_classifier(input_dim, hidden_layers=[64, 32]):
    """
    Build a Deep Neural Network classifier for injury prediction.
    
    Args:
        input_dim (int): Dimension of input features
        hidden_layers (list): List of hidden layer sizes
        
    Returns:
        tf.keras.Model: Compiled DNN model
    """
    # Input layer
    inputs = Input(shape=(input_dim,))
    
    # Hidden layers
    x = inputs
    for units in hidden_layers:
        x = Dense(units, activation='relu')(x)
        x = Dropout(0.3)(x)
    
    # Output layer (binary classification for injury prediction)
    outputs = Dense(1, activation='sigmoid')(x)
    
    # Create and compile the model
    model = Model(inputs, outputs)
    model.compile(
        optimizer=Adam(learning_rate=0.001),
        loss='binary_crossentropy',
        metrics=['accuracy']
    )
    
    return model
# ```

# ## Step 4: GASF-DCAE-DNN Complete Model

# Now, let's integrate these components into a single model class:

# ```python
class GASFDCAEDNNModel:
    """
    Complete GASF-DCAE-DNN model for injury prediction.
    """
    
    def __init__(self, image_size=64, latent_dim=32, dnn_hidden_layers=[64, 32], channels_first=True):
        """
        Initialize the GASF-DCAE-DNN model.
        
        Args:
            image_size (int): Size of GASF images
            latent_dim (int): Dimension of latent space in the autoencoder
            dnn_hidden_layers (list): List of hidden layer sizes for the DNN
            channels_first (bool): Whether the channels dimension is first in input shape
        """
        self.image_size = image_size
        self.latent_dim = latent_dim
        self.dnn_hidden_layers = dnn_hidden_layers
        self.channels_first = channels_first
        self.autoencoder = None
        self.encoder = None
        self.classifier = None
        self.history = None
        
    def preprocess(self, X_seq):
        """
        Convert time series data to GASF images.
        
        Args:
            X_seq (np.ndarray): Time series data
            
        Returns:
            np.ndarray: GASF images
        """
        gasf_images = create_gasf_images(X_seq, self.image_size)
        
        # Rearrange dimensions for TensorFlow (samples, height, width, channels)
        if self.channels_first:
            # From (samples, channels, height, width) to (samples, height, width, channels)
            gasf_images = np.transpose(gasf_images, (0, 2, 3, 1))
            
        return gasf_images
    
    def train(self, X_seq, y_seq, validation_split=0.2, epochs=50, batch_size=32):
        """
        Train the complete GASF-DCAE-DNN model.
        
        Args:
            X_seq (np.ndarray): Time series data
            y_seq (np.ndarray): Target values (injury labels)
            validation_split (float): Fraction of data to use for validation
            epochs (int): Number of training epochs
            batch_size (int): Batch size for training
        """
        print("Preprocessing time series data to GASF images...")
        X_gasf = self.preprocess(X_seq)
        
        n_samples, height, width, channels = X_gasf.shape
        input_shape = (height, width, channels)
        
        # Step 1: Train the autoencoder
        print("Building and training the autoencoder...")
        self.autoencoder, self.encoder = build_dcae_model(input_shape, self.latent_dim)
        
        self.autoencoder.compile(optimizer='adam', loss='mse')
        
        # Train the autoencoder
        autoencoder_history = self.autoencoder.fit(
            X_gasf, X_gasf,
            epochs=epochs,
            batch_size=batch_size,
            shuffle=True,
            validation_split=validation_split,
            verbose=1
        )
        
        # Step 2: Extract features using the trained encoder
        print("Extracting features using the trained encoder...")
        latent_features = self.encoder.predict(X_gasf)
        
        # Step 3: Train the DNN classifier
        print("Building and training the DNN classifier...")
        self.classifier = build_dnn_classifier(self.latent_dim, self.dnn_hidden_layers)
        
        # Train the classifier
        classifier_history = self.classifier.fit(
            latent_features, y_seq,
            epochs=epochs,
            batch_size=batch_size,
            validation_split=validation_split,
            verbose=1
        )
        
        self.history = {
            'autoencoder': autoencoder_history.history,
            'classifier': classifier_history.history
        }
        
        return self.history
    
    def predict(self, X_seq):
        """
        Make predictions using the trained model.
        
        Args:
            X_seq (np.ndarray): Time series data
            
        Returns:
            np.ndarray: Predicted injury probabilities
        """
        if self.encoder is None or self.classifier is None:
            raise ValueError("Model has not been trained yet")
            
        # Preprocess the input
        X_gasf = self.preprocess(X_seq)
        
        # Extract features
        latent_features = self.encoder.predict(X_gasf)
        
        # Make predictions
        predictions = self.classifier.predict(latent_features)
        
        return predictions
    
    def evaluate(self, X_seq, y_true):
        """
        Evaluate the model performance.
        
        Args:
            X_seq (np.ndarray): Time series data
            y_true (np.ndarray): True injury labels
            
        Returns:
            dict: Performance metrics
        """
        if self.encoder is None or self.classifier is None:
            raise ValueError("Model has not been trained yet")
            
        # Get predictions
        y_pred_prob = self.predict(X_seq).flatten()
        y_pred = (y_pred_prob > 0.5).astype(int)
        
        # Calculate metrics
        tn, fp, fn, tp = confusion_matrix(y_true, y_pred).ravel()
        sensitivity = tp / (tp + fn) if (tp + fn) > 0 else 0
        specificity = tn / (tn + fp) if (tn + fp) > 0 else 0
        gmean = np.sqrt(sensitivity * specificity)
        auc = roc_auc_score(y_true, y_pred_prob)
        
        metrics = {
            'auc': auc,
            'gmean': gmean,
            'sensitivity': sensitivity,
            'specificity': specificity,
            'accuracy': (tp + tn) / (tp + tn + fp + fn)
        }
        
        return metrics
    
    def save(self, model_path):
        """
        Save the model components.
        
        Args:
            model_path (str): Base path to save model components
        """
        if self.autoencoder is None or self.encoder is None or self.classifier is None:
            raise ValueError("Model has not been trained yet")
            
        # Save components
        self.autoencoder.save(f"{model_path}_autoencoder.h5")
        self.encoder.save(f"{model_path}_encoder.h5")
        self.classifier.save(f"{model_path}_classifier.h5")
        
        # Save parameters
        params = {
            'image_size': self.image_size,
            'latent_dim': self.latent_dim,
            'dnn_hidden_layers': self.dnn_hidden_layers,
            'channels_first': self.channels_first
        }
        with open(f"{model_path}_params.json", 'w') as f:
            json.dump(params, f)
    
    @classmethod
    def load(cls, model_path):
        """
        Load a saved model.
        
        Args:
            model_path (str): Base path where model components are saved
            
        Returns:
            GASFDCAEDNNModel: Loaded model
        """
        # Load parameters
        with open(f"{model_path}_params.json", 'r') as f:
            params = json.load(f)
            
        # Create instance
        instance = cls(**params)
        
        # Load model components
        instance.autoencoder = tf.keras.models.load_model(f"{model_path}_autoencoder.h5")
        instance.encoder = tf.keras.models.load_model(f"{model_path}_encoder.h5")
        instance.classifier = tf.keras.models.load_model(f"{model_path}_classifier.h5")
        
        return instance
# ```

# ## Step 5: Integration with Your Experimental Framework

# Now, let's integrate this model into your existing experimental framework:

# ```python
def train_gasf_dcae_dnn_model(X_seq, y_seq, ts_params, config, **kwargs):
    """
    Train a GASF-DCAE-DNN model for time series prediction.
    
    Args:
        X_seq (np.ndarray): Preprocessed time series sequences
        y_seq (np.ndarray): Target values
        ts_params (dict): Time series parameters
        config (dict): Configuration dictionary
        **kwargs: Additional parameters
        
    Returns:
        GASFDCAEDNNModel: Trained model
    """
    # Extract parameters
    image_size = kwargs.get('image_size', 64)
    latent_dim = kwargs.get('latent_dim', 32)
    dnn_hidden_layers = kwargs.get('dnn_hidden_layers', [64, 32])
    epochs = kwargs.get('epochs', 50)
    batch_size = kwargs.get('batch_size', 32)
    validation_split = kwargs.get('validation_split', 0.2)
    
    # Create and train the model
    model = GASFDCAEDNNModel(
        image_size=image_size,
        latent_dim=latent_dim,
        dnn_hidden_layers=dnn_hidden_layers
    )
    
    # Convert target to binary for injury prediction
    # (Assuming y_seq contains injury labels or can be converted to them)
    if len(y_seq.shape) > 1 and y_seq.shape[1] > 1:
        # If y_seq is multi-dimensional, we need to convert it to binary labels
        # This is just an example - adjust based on your actual target structure
        y_binary = np.max(y_seq, axis=1) > 0.5
    else:
        y_binary = y_seq > 0.5
    
    # Train the model
    history = model.train(
        X_seq,
        y_binary,
        validation_split=validation_split,
        epochs=epochs,
        batch_size=batch_size
    )
    
    # Evaluate the model
    metrics = model.evaluate(X_seq, y_binary)
    print("Model performance:")
    for metric, value in metrics.items():
        print(f"  {metric}: {value:.4f}")
    
    # Save model if required
    if config["paths"].get("model_save_path"):
        model_path = os.path.join(
            config["paths"]["model_save_path"],
            f"gasf_dcae_dnn_{datetime.now().strftime('%Y%m%d_%H%M%S')}"
        )
        model.save(model_path)
    
    return model

def run_gasf_dcae_dnn_experiment(data, config, sequence_modes):
    """
    Run GASF-DCAE-DNN experiment for different sequence modes.
    
    Args:
        data (pd.DataFrame): Input data
        config (dict): Configuration dictionary
        sequence_modes (list): List of sequence modes to test
        
    Returns:
        dict: A dictionary of experiment results
    """
    all_results = {}
    
    for mode in sequence_modes:
        print(f"\n{'='*80}")
        print(f"Testing GASF-DCAE-DNN with sequence mode: {mode}")
        print(f"{'='*80}")
        
        try:
            # Update ts_params with time-series configuration for the experiment
            ts_params = {
                "enabled": True,
                "time_column": "ongoing_timestamp_biomech",
                "window_size": 500,
                "horizon": 100,
                "step_size": 1,
                "max_sequence_length": 500,
                "time_series_sequence_mode": mode,
                "handle_outliers": {
                    "time_series_method": "none",
                }
            }
            
            # Create preprocessor with current sequence mode
            preprocessor = DataPreprocessor(
                model_type="custom",  # Use custom for our special model
                y_variable=config["features"]["y_variable"],
                ordinal_categoricals=config["features"]["ordinal_categoricals"],
                nominal_categoricals=config["features"]["nominal_categoricals"],
                numericals=config["features"]["numericals"],
                mode="train",
                options=ts_params,
                debug=True,
                graphs_output_dir=config["paths"]["plots_output_dir"],
                transformers_dir=config["paths"]["transformers_save_base_dir"],
                time_column=ts_params.get("time_column"),
                window_size=ts_params.get("window_size"),
                horizon=ts_params.get("horizon"),
                step_size=ts_params.get("step_size"),
                max_sequence_length=ts_params.get("max_sequence_length"),
                sequence_categorical=["session_biomech", "trial_biomech"],
                sequence_dtw_or_pad_categorical=["pitch_phase_biomech"],
                time_series_sequence_mode=mode
            )
            
            # Preprocess the data
            X_seq, _, y_seq, _, recommendations, _ = preprocessor.final_preprocessing(data)
            
            # Train GASF-DCAE-DNN model
            model = train_gasf_dcae_dnn_model(
                X_seq,
                y_seq,
                ts_params,
                config,
                image_size=64,
                latent_dim=32,
                dnn_hidden_layers=[64, 32],
                epochs=50,
                batch_size=32
            )
            
            # Evaluate the model
            metrics = model.evaluate(X_seq, y_seq > 0.5)  # Convert to binary
            
            # Store results
            all_results[mode] = {
                'metrics': metrics,
                'sequence_shape': X_seq.shape,
                'preprocessor_recommendations': recommendations
            }
            
            # Clean up
            del model
            tf.keras.backend.clear_session()
            
        except Exception as e:
            print(f"Error with sequence mode {mode}: {str(e)}")
            all_results[mode] = {'error': str(e)}
    
    return all_results
# ```

# ## Using the GASF-DCAE-DNN Model in Your Existing Framework

# To add this model to your existing experiment framework, you need to make the following modifications:

# ```python
# Define model architectures to test
model_architectures = [
    # Your existing models
    # {'use_tcn': False, 'bidirectional': False},  # LSTM
    # {'use_tcn': False, 'bidirectional': True},   # BiLSTM
    # {'use_tcn': True, 'bidirectional': False},   # TCN-LSTM
    {'use_tcn': True, 'bidirectional': True},     # TCN-BiLSTM
    
    # Add GASF-DCAE-DNN model
    {'model_type': 'gasf_dcae_dnn', 'image_size': 64, 'latent_dim': 32}
]

# Modify run_sequence_mode_experiment to handle the GASF-DCAE-DNN model type
def run_sequence_mode_experiment(data, config, sequence_mode, model_architectures):
    """
    Modified version to include GASF-DCAE-DNN model.
    """
    results = {}
    
    # Time series parameters setup (same as before)
    ts_params = {
        "enabled": True,
        "time_column": "ongoing_timestamp_biomech",
        "window_size": 500,
        "horizon": 100,
        "step_size": 1,
        "max_sequence_length": 500,
        "time_series_sequence_mode": sequence_mode,
        "handle_outliers": {
            "time_series_method": "none",
        }
    }
    
    print(f"\n{'='*80}")
    print(f"Testing sequence mode: {sequence_mode}")
    print(f"{'='*80}")
    
    try:
        # Create preprocessor (same as before)
        preprocessor = DataPreprocessor(
            model_type="custom",  # Changed to custom for flexibility
            y_variable=config["features"]["y_variable"],
            ordinal_categoricals=config["features"]["ordinal_categoricals"],
            nominal_categoricals=config["features"]["nominal_categoricals"],
            numericals=config["features"]["numericals"],
            mode="train",
            options=ts_params,
            debug=True,
            graphs_output_dir=config["paths"]["plots_output_dir"],
            transformers_dir=config["paths"]["transformers_save_base_dir"],
            time_column=ts_params.get("time_column"),
            window_size=ts_params.get("window_size"),
            horizon=ts_params.get("horizon"),
            step_size=ts_params.get("step_size"),
            max_sequence_length=ts_params.get("max_sequence_length"),
            sequence_categorical=["session_biomech", "trial_biomech"],
            sequence_dtw_or_pad_categorical=["pitch_phase_biomech"],
            time_series_sequence_mode=sequence_mode
        )
        
        # Preprocess the data (same as before)
        X_seq, _, y_seq, _, recommendations, _ = preprocessor.final_preprocessing(data)
        
        # Test each provided model architecture
        for arch in model_architectures:
            # Check if this is a GASF-DCAE-DNN model
            if arch.get('model_type') == 'gasf_dcae_dnn':
                arch_name = "GASF-DCAE-DNN"
                print(f"\nTesting architecture: {arch_name}")
                
                try:
                    # Train the GASF-DCAE-DNN model
                    model = train_gasf_dcae_dnn_model(
                        X_seq, 
                        y_seq, 
                        ts_params, 
                        config, 
                        image_size=arch.get('image_size', 64),
                        latent_dim=arch.get('latent_dim', 32),
                        dnn_hidden_layers=arch.get('dnn_hidden_layers', [64, 32]),
                        epochs=arch.get('epochs', 50),
                        batch_size=arch.get('batch_size', 32)
                    )
                    
                    # Evaluate the model
                    metrics = model.evaluate(X_seq, y_seq > 0.5)
                    
                    # Store results in the dictionary
                    results[arch_name] = {
                        'metrics': metrics,
                        'sequence_shape': X_seq.shape,
                        'architecture': arch
                    }
                    
                    # Clean up
                    del model
                    tf.keras.backend.clear_session()
                    
                except Exception as e:
                    print(f"Error with architecture {arch_name}: {str(e)}")
                    results[arch_name] = {'error': str(e)}
            else:
                # Original LSTM/TCN handling (as in your code)
                arch_name = f"{'TCN-' if arch['use_tcn'] else ''}{'Bi' if arch['bidirectional'] else ''}LSTM"
                print(f"\nTesting architecture: {arch_name}")
                
                try:
                    # Train the model using your existing function
                    model = train_lstm_model(
                        X_seq, 
                        y_seq, 
                        ts_params, 
                        config, 
                        use_tcn=arch['use_tcn'],
                        bidirectional=arch['bidirectional']
                    )
                    
                    # Evaluate and store results as before
                    eval_metrics = model.evaluate(X_seq, y_seq, verbose=0)
                    metric_names = ['loss', 'mae', 'rmse', 'r2', 'mape']
                    
                    results[arch_name] = {
                        'metrics': dict(zip(metric_names, eval_metrics)),
                        'sequence_shape': X_seq.shape,
                        'architecture': arch
                    }
                    
                    # Clean up
                    del model
                    tf.keras.backend.clear_session()
                    
                except Exception as e:
                    print(f"Error with architecture {arch_name}: {str(e)}")
                    results[arch_name] = {'error': str(e)}
        
    except Exception as e:
        print(f"Error with sequence mode {sequence_mode}: {str(e)}")
        results['preprocessing_error'] = str(e)
    
    return results
# ```

# ## Performance and Considerations

# The GASF-DCAE-DNN model offers several advantages for injury prediction:

# 1. **Superior Pattern Recognition**: By transforming time series data into images, the model captures complex temporal relationships that might be missed by traditional sequence models.

# 2. **Feature Learning**: The DCAE component automatically learns discriminative features, reducing the need for manual feature engineering.

# 3. **High Performance Metrics**: As documented in research, this approach achieves exceptional performance metrics:
#    - AUC: 0.891 ± 0.026 (test set)
#    - Sensitivity: 0.816 ± 0.039
#    - Specificity: 0.845 ± 0.022
#    - Gmean: 0.830 ± 0.027

# 4. **Handling Multiple Sequence Modes**: The implementation is flexible enough to work with your different sequence preprocessing modes (set_window, dtw, pad, or variable_length).

# The main trade-off is increased computational complexity compared to simpler models, but this is justified by the significant performance improvements.

# ## Conclusion

# This implementation allows you to integrate the state-of-the-art GASF-DCAE-DNN model into your existing experimental framework. The model transforms time series data into image representations, extracts features using a deep convolutional auto-encoder, and makes predictions using a deep neural network.

# By incorporating this advanced approach alongside your existing models, you can compare its performance in predicting sports injuries and potentially achieve the superior results documented in the literature.

# ---
# Answer from Perplexity: pplx.ai/share