Certainly! Let's comprehensively align each preprocessing step with specific model types, elucidating the rationale behind the chosen strategies. This structured approach ensures that the data preparation is optimally tailored to the nuances and requirements of each machine learning algorithm in your pipeline.

---

## **Step-by-Step Alignment and Reasoning**

### **1. Handle Missing Values**

**Goal:** Ensure that the dataset is free from missing values to prevent errors during model training and to maintain data integrity.

**Model-Specific Strategies and Reasoning:**

- **Linear Regression & Logistic Regression:**
  - **Numerical Features:** **Mean Imputation** is preferred as these models assume a linear relationship and are sensitive to the scale of features. Mean imputation maintains the distribution without introducing bias.
  - **Categorical Features:** **Mode Imputation** ensures that the most frequent category is retained, aligning with the probabilistic nature of logistic regression.

- **Tree-Based Models (e.g., Decision Trees, Random Forests):**
  - **Numerical Features:** **Median Imputation** is robust against outliers, preserving the central tendency without skewing the data.
  - **Categorical Features:** **Constant ("Missing") Label** is used to explicitly indicate missingness, allowing tree-based models to handle missing categories effectively without introducing bias.

- **Support Vector Machines (SVM), k-Nearest Neighbors (k-NN), Neural Networks (NN):**
  - **Numerical Features:** **Median Imputation** helps in maintaining the scale and distribution, crucial for distance-based algorithms like SVM and k-NN.
  - **Categorical Features:** **Mode Imputation** ensures consistency in categorical representation, which is vital for neural networks that require fixed input sizes.

- **Clustering Algorithms:**
  - **Numerical Features:** **Median Imputation** aids in maintaining cluster integrity by preventing distortion from outliers.
  - **Categorical Features:** **Mode Imputation** or **Constant Label** depending on the clustering algorithm's sensitivity to missing categories.

- **Time Series Models:**
  - **Numerical Features:** **Interpolation** is essential to maintain temporal continuity and trend, ensuring that the time-dependent patterns are preserved.
  - **Categorical Features:** **Mode Imputation** maintains consistency in categorical states over time.

**Rationale:**

Different models have varying sensitivities to missing data and the impact of imputation strategies. Aligning imputation methods with model-specific requirements ensures that the preprocessing does not inadvertently bias the model or degrade its performance.

---

### **2. Test for Normality**

**Goal:** Assess whether numerical feature distributions conform to normality, influencing subsequent transformation decisions to optimize model performance.

**Model-Specific Strategies and Reasoning:**

- **Linear Regression & Logistic Regression:**
  - **Importance of Normality:** These models benefit from normally distributed residuals for valid statistical inference and reliable parameter estimates.
  - **Action:** If features deviate from normality, apply transformations (e.g., Log, Box-Cox, Yeo-Johnson) to normalize distributions, enhancing model assumptions and performance.

- **Neural Networks, SVM, k-NN, Clustering:**
  - **Importance of Normality:** While not strictly necessary, normalizing skewed data can stabilize training processes and improve convergence.
  - **Action:** Apply transformations if skewness is severe (e.g., |skewness| > 1.0) to enhance feature scaling and model robustness.

- **Tree-Based Models:**
  - **Importance of Normality:** These models are inherently insensitive to feature distribution due to their hierarchical splitting mechanisms.
  - **Action:** Skip transformations unless features exhibit extreme skewness (e.g., |skewness| > 2.0), in which case transformations like Yeo-Johnson can be considered to prevent potential computational issues.

- **Time Series Models:**
  - **Importance of Normality:** Focuses more on stationarity and stable variance rather than strict normality.
  - **Action:** Apply transformations such as logarithmic scaling to stabilize variance and trends, facilitating better model performance.

**Rationale:**

Understanding the distribution of features aids in selecting appropriate transformations that align with model assumptions, thereby enhancing model efficacy and interpretability.

---

### **3. Handle Outliers**

**Goal:** Mitigate the influence of extreme values that can distort model training, especially for sensitive algorithms.

**Model-Specific Strategies and Reasoning:**

- **Linear Regression & Logistic Regression:**
  - **Sensitivity to Outliers:** High; outliers can significantly skew the model's parameters.
  - **Action:** Implement **Z-Score Filtering** to remove data points beyond 3 standard deviations, followed by **IQR Filtering** to exclude points outside 1.5 * IQR. These methods ensure that the model is trained on representative data without extreme distortions.

- **Support Vector Machines (SVM) & k-Nearest Neighbors (k-NN):**
  - **Sensitivity to Outliers:** High; outliers can affect distance calculations and decision boundaries.
  - **Action:** Utilize **IQR Filtering** and **Winsorization** to limit the impact of outliers by capping extreme values, thus preserving the integrity of distance-based computations.

- **Neural Networks (NN):**
  - **Sensitivity to Outliers:** Moderate to high; while NNs can handle some variability, extreme values can impede learning.
  - **Action:** Apply **Winsorization** or **Clipping** to restrict feature ranges, ensuring stable gradient updates and convergence.

- **Clustering Algorithms:**
  - **Sensitivity to Outliers:** Moderate; outliers can form their own clusters or distort existing ones.
  - **Action:** Implement **Isolation Forest** to detect and exclude anomalous data points, maintaining the quality of cluster formations.

- **Tree-Based Models:**
  - **Sensitivity to Outliers:** Low; these models are robust to outliers due to their splitting criteria.
  - **Action:** Generally, no action is required unless outliers are extreme enough to cause computational issues, in which case minimal handling (e.g., Winsorization) can be performed.

- **Time Series Models:**
  - **Sensitivity to Outliers:** High; outliers can disrupt trend analysis and forecasting.
  - **Action:** Apply **Rolling Statistics (Smoothing)** to mitigate sudden spikes and **Winsorization** to cap extreme values, ensuring smoother temporal patterns.

**Rationale:**

Tailoring outlier handling methods to each model's sensitivity ensures that preprocessing enhances model performance without introducing bias or data distortion.

---

### **4. Split Dataset into Train/Test and X/y**

**Goal:** Divide the dataset into training and testing subsets to evaluate model performance on unseen data, preventing data leakage and ensuring unbiased assessments.

**Model-Specific Strategies and Reasoning:**

- **All Models (Classification & Regression):**
  - **Stratified Split for Classification:** Ensures that both training and testing sets maintain similar class distributions, crucial for balanced model evaluation.
  - **Random Split for Regression:** Stratification isn't applicable; a simple random split suffices to maintain data distribution.

- **Considerations for Time Series Models:**
  - **Temporal Split:** Instead of random splitting, use a chronological split to prevent future data from leaking into the training set, maintaining temporal integrity.

**Action Points:**

- **For Classification Tasks:**
  - **Stratified Splitting:** Utilize stratification to preserve class proportions across training and testing sets.
  
- **For Regression Tasks:**
  - **Random Splitting:** Ensure random selection to maintain feature distribution without relying on stratification.

- **For Time Series Tasks:**
  - **Sequential Splitting:** Split based on time to respect the temporal order, crucial for accurate forecasting.

**Rationale:**

Proper dataset splitting ensures that models are trained and evaluated on representative and unbiased subsets, fostering reliable performance metrics.

---

### **5. Choose and Apply Transformations (Based on Normality Tests)**

**Goal:** Modify feature distributions to align with model assumptions or to enhance model performance.

**Model-Specific Strategies and Reasoning:**

- **Linear Regression & Logistic Regression:**
  - **Necessity of Normality:** High; transformations are applied to normalize feature distributions, improving linear relationships and model assumptions.
  - **Action:** Apply **Log Transformation**, **Box-Cox**, or **Yeo-Johnson** to skewed features, ensuring that residuals approximate normality.

- **Neural Networks, SVM, k-NN, Clustering:**
  - **Necessity of Normality:** Moderate; transformations can stabilize training and improve convergence, especially for highly skewed data.
  - **Action:** Apply transformations like **Yeo-Johnson** for moderately skewed features or **Log Transformation** for highly skewed features to enhance feature scaling and model robustness.

- **Tree-Based Models:**
  - **Necessity of Normality:** Low; trees handle various feature distributions naturally.
  - **Action:** Skip transformations unless features exhibit extreme skewness (e.g., |skewness| > 2.0), in which case apply minimal transformations like **Yeo-Johnson** to prevent computational inefficiencies.

- **Time Series Models:**
  - **Necessity of Normality:** Moderate; focus on stabilizing variance and achieving stationarity.
  - **Action:** Apply transformations such as **Log Transformation** or **Differencing** to stabilize variance and remove trends, facilitating better forecasting accuracy.

**Rationale:**

Aligning feature distributions with model expectations or enhancing data properties through transformations can significantly improve model performance, convergence, and interpretability.

---

### **6. Encode Categorical Variables**

**Goal:** Convert categorical data into numerical formats to enable model ingestion, while preserving the inherent relationships and structures within the data.

**Model-Specific Strategies and Reasoning:**

- **Linear Regression & Logistic Regression:**
  - **Ordinal Features:** Use **Ordinal Encoding** if categories have a meaningful order.
  - **Nominal Features:** Apply **One-Hot Encoding** to prevent the model from assuming ordinal relationships where none exist.

- **Neural Networks, SVM, k-NN, Clustering:**
  - **Ordinal Features:** Use **Ordinal Encoding** to retain order information.
  - **Nominal Features:** Utilize **One-Hot Encoding** to facilitate distance-based computations and prevent the introduction of artificial ordinality.

- **Tree-Based Models:**
  - **Ordinal Features:** Use **Ordinal Encoding**, as trees can effectively utilize ordered categorical information.
  - **Nominal Features:** **Ordinal Encoding** is sufficient, given trees' ability to handle categorical splits without needing one-hot representations.

**Action Points:**

- **For Nominal Features in Distance-Based Models (e.g., SVM, k-NN):**
  - **One-Hot Encoding:** Enhances feature space representation, allowing the model to compute distances accurately without assuming ordinality.

- **For Nominal Features in Tree-Based Models:**
  - **Ordinal Encoding:** Simplifies the feature space and leverages trees' inherent ability to handle categorical splits without redundancy.

**Rationale:**

Selecting appropriate encoding methods based on feature types and model sensitivities ensures that categorical information is represented accurately, preventing model misinterpretation and enhancing predictive performance.

---

### **7. Apply Scaling (If Needed by Model)**

**Goal:** Normalize feature scales to ensure that models relying on feature magnitudes or distances operate effectively.

**Model-Specific Strategies and Reasoning:**

- **Linear Regression & Logistic Regression:**
  - **Scaling Method:** **StandardScaler** is preferred to center features around zero with unit variance, facilitating gradient-based optimization and improving model convergence.

- **Neural Networks:**
  - **Scaling Method:** **StandardScaler** helps in stabilizing learning by ensuring that input features contribute equally, preventing dominant gradients.

- **Support Vector Machines (SVM), k-Nearest Neighbors (k-NN), Clustering Algorithms:**
  - **Scaling Method:** **MinMaxScaler** scales features to a specific range (e.g., [0,1]), which is crucial for distance-based computations and ensuring that all features contribute proportionately to distance calculations.

- **Tree-Based Models:**
  - **Scaling Necessity:** **Not Required.** Trees are insensitive to feature scales due to their splitting criteria based on order and thresholds, making scaling unnecessary and computationally redundant.

- **Time Series Models:**
  - **Scaling Necessity:** **Optional.** Depending on the specific model and its sensitivity to feature scales, scaling may be applied to stabilize variance or improve computational efficiency.

**Rationale:**

Appropriate scaling enhances model performance by ensuring that features contribute appropriately to the learning process, particularly for algorithms sensitive to feature magnitudes and distances.

---

### **8. Implement SMOTE (Train Only)**

**Goal:** Address class imbalance in classification tasks to prevent models from being biased towards majority classes, enhancing predictive performance on minority classes.

**Model-Specific Strategies and Reasoning:**

- **All Classification Models (Logistic Regression, Neural Networks, SVM, k-NN, Clustering, Tree-Based Models):**
  - **Presence of Categorical Features:**
    - **SMOTENC:** Utilized when both numerical and categorical features are present, preserving the categorical structure during synthetic sample generation.
  - **All Numerical Features:**
    - **SMOTE:** Applied to purely numerical datasets to generate synthetic minority class samples, balancing class distributions.
  - **All Categorical Features:**
    - **SMOTEN:** Applied to purely categorical datasets to generate synthetic minority class samples, balancing class distributions.

- **Non-Classification Models (e.g., Regression, Time Series):**
  - **Action:** **Do Not Apply SMOTE.** SMOTE is specific to classification tasks; applying it to regression or time series models is inappropriate and can distort the data.

**Action Points:**

- **Apply SMOTE Only to Training Data:** Ensures that the synthetic balancing does not influence the evaluation on unseen test data, maintaining the integrity of performance metrics.

- **Maintain Encoders and Scalers:** Ensure that encoding and scaling are performed before SMOTE to handle both numerical and categorical data appropriately.

**Rationale:**

Balancing class distributions through SMOTE enhances the model's ability to learn from minority classes, preventing bias towards majority classes and improving overall classification performance.

---

### **9. Train Model on Preprocessed Training Data**

**Goal:** Fit the selected machine learning model on the meticulously preprocessed and balanced training dataset to learn underlying patterns and make accurate predictions.

**Model-Specific Strategies and Reasoning:**

- **Linear Regression:**
  - **Focus:** Learning linear relationships between features and the target variable.
  - **Action:** Utilize the preprocessed data to train the model, ensuring that feature scales and distributions align with model assumptions for optimal parameter estimation.

- **Logistic Regression:**
  - **Focus:** Binary or multi-class classification based on linear decision boundaries.
  - **Action:** Train on preprocessed, balanced data to accurately estimate the probability of class memberships, leveraging imputed, encoded, and scaled features.

- **Neural Networks:**
  - **Focus:** Capturing complex, non-linear relationships through multiple layers and neurons.
  - **Action:** Train on scaled and encoded data to ensure stable gradient updates and effective learning of intricate patterns.

- **Support Vector Machines (SVM):**
  - **Focus:** Finding optimal hyperplanes that maximize class separation.
  - **Action:** Train on scaled and encoded data to facilitate accurate distance computations and hyperplane placement.

- **k-Nearest Neighbors (k-NN):**
  - **Focus:** Classifying based on the proximity of data points in feature space.
  - **Action:** Train on scaled and encoded data to ensure that distance metrics accurately reflect feature similarities.

- **Clustering Algorithms (e.g., K-Means):**
  - **Focus:** Grouping similar data points based on feature similarity.
  - **Action:** Train on scaled and encoded data to ensure that cluster assignments are based on meaningful feature proximities.

- **Tree-Based Models (e.g., Decision Trees, Random Forests):**
  - **Focus:** Hierarchical splitting based on feature thresholds to capture non-linear relationships.
  - **Action:** Train on encoded data without the necessity for scaling, leveraging the model's robustness to feature scales and distributions.

**Rationale:**

Model training leverages the preprocessed data to ensure that each algorithm's strengths are maximized, while also adhering to its specific requirements regarding feature representation and scaling.

---

### **10. Predict on Test Data (No SMOTE on Test)**

**Goal:** Evaluate model performance on unseen data without introducing synthetic samples, ensuring that performance metrics reflect real-world applicability.

**Model-Specific Strategies and Reasoning:**

- **All Models:**
  - **Consistent Preprocessing:** Ensure that test data undergoes the same encoding and scaling transformations as the training data to maintain consistency in feature representation.
  - **Avoid SMOTE:** Synthetic samples generated during training do not apply to the test set, preserving the natural class distribution and providing an accurate assessment of model performance.

**Action Points:**

- **Apply Fitted Encoders and Scalers:** Use transformers fitted on the training data to transform the test set, maintaining feature consistency.
  
- **Maintain Original Class Distribution:** Do not alter the test set's class distribution to ensure that performance metrics (e.g., accuracy, precision, recall) are unbiased and reflective of real-world scenarios.

**Rationale:**

Evaluating models on natural, unaltered test data provides an authentic measure of their predictive capabilities, ensuring that performance assessments are reliable and applicable to real-world data distributions.

---

### **11. Final Inverse Transformations for Interpretability**

**Goal:** Revert preprocessed data back to its original form for interpretability, reporting, and actionable insights, facilitating better understanding and communication of model results.

**Model-Specific Strategies and Reasoning:**

- **All Models:**
  - **Inverse Scaling:** Use the stored scaler parameters (e.g., means and scales from `StandardScaler` or `MinMaxScaler`) to revert numerical features to their original scales, enhancing interpretability of feature impacts.
  
  - **Inverse Encoding:**
    - **Ordinal Features:** Utilize `OrdinalEncoder`'s inverse transform to retrieve original categorical labels, maintaining the integrity of ordinal relationships.
    - **Nominal Features with One-Hot Encoding:** Use `OneHotEncoder`'s inverse transform to reconstruct original categorical variables from their encoded representations.
    - **Nominal Features with Ordinal Encoding (Tree-Based Models):** Apply `OrdinalEncoder`'s inverse transform to retrieve original labels, ensuring that encoded representations are accurately mapped back.

- **Special Considerations for Transformations:**
  - **Log and Yeo-Johnson Transformations:** Apply mathematical inverses (e.g., exponential for log transforms) to revert feature values to their original distributions, aiding in result interpretation.

**Action Points:**

- **Maintain Transformer Objects:** Ensure that all transformers (encoders, scalers, power transformers) are stored post-training to facilitate accurate inverse transformations.
  
- **Handle One-Hot Encoded Features:** Carefully map one-hot encoded columns back to their original categorical labels, ensuring that the reconstructed data aligns with original feature structures.

**Rationale:**

Inverse transformations bridge the gap between model computations and real-world interpretations, enabling stakeholders to understand feature influences and model predictions in familiar terms.

---

### **12. Final Inverse Transformation Validation**

**Goal:** Confirm that inverse transformations accurately restore data to its original form within acceptable tolerances, ensuring the reliability of interpretability efforts.

**Model-Specific Strategies and Reasoning:**

- **All Models:**
  - **Numerical Features:**
    - **Tolerance Levels:** Allow for minor discrepancies due to floating-point precision (e.g., Mean Absolute Error < 1e-4).
    - **Validation Metrics:** Utilize statistical measures such as Mean Absolute Error (MAE) or Root Mean Squared Error (RMSE) to quantify differences between original and inverse-transformed data.
  
  - **Categorical Features:**
    - **Exact Matches:** Ensure that inverse-transformed categories exactly match the original labels, as discrepancies can indicate encoding issues.
  
  - **One-Hot Encoded Features:**
    - **Consistent Mapping:** Validate that each one-hot encoded vector correctly maps back to its original categorical label without loss of information.

- **Action Points:**
  - **Automate Validation Checks:** Implement automated assertions or checks that flag features exceeding predefined difference thresholds.
  
  - **Visual Comparisons:** Plot original vs. inverse-transformed feature distributions to visually assess alignment.
  
  - **Detailed Logging:** Record discrepancies and their magnitudes to facilitate troubleshooting and pipeline refinement.

**Rationale:**

Validating inverse transformations ensures that interpretability remains faithful to the original data, preventing misleading conclusions and maintaining the integrity of data-driven insights.

---

## **Conclusion**

Aligning each preprocessing step with specific model types, grounded in clear reasoning, ensures that your data preparation pipeline is both robust and optimized for diverse machine learning algorithms. This meticulous approach enhances model performance, interpretability, and reliability, fostering a seamless transition from raw data to actionable insights.

By adhering to these model-specific strategies, you ensure that each preprocessing decision is intentional and contributes meaningfully to the overarching goals of your machine learning workflow. Should you need further elaboration on any step or require assistance with additional pipeline components, feel free to reach out!

In [None]:
# data_preprocessor.py

import pandas as pd
import numpy as np
from sklearn.impute import SimpleImputer
from sklearn.preprocessing import StandardScaler, MinMaxScaler, PowerTransformer, OrdinalEncoder
from sklearn.model_selection import train_test_split
from imblearn.over_sampling import BorderlineSMOTE, ADASYN, SMOTE, SMOTENC
from imblearn.combine import SMOTEENN, SMOTETomek
from sklearn.ensemble import IsolationForest
from scipy.stats import shapiro, anderson
from sklearn.compose import ColumnTransformer
from sklearn.pipeline import Pipeline
import logging
from typing import List, Optional, Dict, Tuple


class DataPreprocessor:
    def __init__(
        self, 
        model_type: str, 
        column_assets: Dict[str, List[str]], 
        perform_split: bool = True, 
        debug: bool = False
    ):
        """
        Initialize the DataPreprocessor.

        Args:
            model_type (str): Type of the machine learning model.
            column_assets (dict): Dictionary containing lists of ordinal, nominal, and numerical columns.
            perform_split (bool): Whether to perform train-test split and apply SMOTE.
            debug (bool): Flag to enable detailed debugging information.
        """
        self.model_type = model_type
        self.ordinal_categoricals = column_assets.get('ordinal_categoricals', [])
        self.nominal_categoricals = column_assets.get('nominal_categoricals', [])
        self.numericals = column_assets.get('numericals', [])
        self.y_variable = column_assets.get('y_variable', '')
        self.perform_split = perform_split
        self.debug = debug

        # Containers for transformers
        self.preprocessor = None
        self.pipeline = None
        self.smote = None

        # Initialize transformer to None
        self.transformer = None

        # To keep track of preprocessing steps and reasons
        self.preprocessing_steps = []
        self.feature_reasons = {col: '' for col in self.ordinal_categoricals + self.nominal_categoricals + self.numericals}

        # Configure logging
        self.logger = logging.getLogger(self.__class__.__name__)
        self.logger.setLevel(logging.DEBUG if self.debug else logging.INFO)
        handler = logging.StreamHandler()
        formatter = logging.Formatter('%(asctime)s [%(levelname)s] %(message)s')
        handler.setFormatter(formatter)
        if not self.logger.handlers:
            self.logger.addHandler(handler)

        # Store encoders for inverse transformation
        self.ordinal_encoder = None
        self.nominal_encoder = None

    def handle_missing_values(self, X: pd.DataFrame) -> pd.DataFrame:
        """
        Handle missing values for numerical and categorical features.

        Args:
            X (pd.DataFrame): Input features.

        Returns:
            pd.DataFrame: Features with imputed missing values.
        """
        step_name = "Handling Missing Values"
        self.logger.info(f"Step: {step_name}")

        # Numerical Imputation
        if self.numericals:
            if self.model_type in ['Linear Regression', 'Logistic Regression']:
                strategy = 'mean'
            elif self.model_type in ['Time Series Models']:
                strategy = 'median'  # 'interpolate' is handled differently
            else:
                strategy = 'median'
            self.logger.debug(f"Numerical Imputation Strategy: {strategy.capitalize()}")

            if strategy == 'interpolate':
                X[self.numericals] = X[self.numericals].interpolate(method='linear').fillna(method='bfill').fillna(method='ffill')
                for col in self.numericals:
                    self.feature_reasons[col] += f'Numerical: Interpolation Imputation | '
            else:
                numerical_imputer = SimpleImputer(strategy=strategy)
                X[self.numericals] = numerical_imputer.fit_transform(X[self.numericals])
                self.numerical_imputer = numerical_imputer
                for col in self.numericals:
                    self.feature_reasons[col] += f'Numerical: {strategy.capitalize()} Imputation | '

        # Categorical Imputation
        all_categoricals = self.ordinal_categoricals + self.nominal_categoricals
        if all_categoricals:
            if self.model_type in ['Tree-Based Models']:
                strategy = 'constant'
                fill_value = 'Missing'
                self.logger.debug(f"Categorical Imputation Strategy: {strategy.capitalize()} with fill value '{fill_value}'")
                categorical_imputer = SimpleImputer(strategy=strategy, fill_value=fill_value)
                X[all_categoricals] = categorical_imputer.fit_transform(X[all_categoricals])
                self.categorical_imputer = categorical_imputer
                for col in all_categoricals:
                    self.feature_reasons[col] += f'Categorical: Constant Imputation | '
            else:
                strategy = 'most_frequent'
                self.logger.debug(f"Categorical Imputation Strategy: {strategy.capitalize()}")
                categorical_imputer = SimpleImputer(strategy=strategy)
                X[all_categoricals] = categorical_imputer.fit_transform(X[all_categoricals])
                self.categorical_imputer = categorical_imputer
                for col in all_categoricals:
                    self.feature_reasons[col] += f'Categorical: Mode Imputation | '

        self.preprocessing_steps.append(step_name)
        self.logger.info(f"Completed: {step_name}. Dataset shape: {X.shape}")

        if self.debug:
            self.logger.debug(f"DataFrame shape after {step_name}: {X.shape}")
            self.logger.debug(f"Columns after {step_name}: {all_categoricals + self.numericals}")
            for col in all_categoricals + self.numericals:
                self.logger.debug(f"Column '{col}' - Data Type: {X[col].dtype}, Sample Values: {X[col].dropna().unique()[:5]}")

        return X

    def test_normality(self, X: pd.DataFrame) -> Dict[str, Dict]:
        """
        Test normality for numerical features.

        Args:
            X (pd.DataFrame): Input features.

        Returns:
            dict: Normality test results per feature.
        """
        step_name = "Testing for Normality"
        self.logger.info(f"Step: {step_name}")
        normality_results = {}
        for col in self.numericals:
            data = X[col].dropna()
            skewness = data.skew()
            kurtosis = data.kurtosis()
            n = data.shape[0]

            # Choose test based on dataset size
            if n <= 5000:
                stat, p_value = shapiro(data)
                test_used = 'Shapiro-Wilk'
            else:
                result = anderson(data)
                stat = result.statistic
                p_value = 0.0  # Assume not normal if no critical value is higher than stat

                for cv, sig in zip(result.critical_values, result.significance_level):
                    if stat < cv:
                        p_value = sig / 100
                        break

            is_normal = p_value > 0.05
            normality_results[col] = {
                'skewness': skewness,
                'kurtosis': kurtosis,
                'p_value': p_value,
                'test_used': test_used,
                'is_normal': is_normal
            }
            self.logger.info(f"Normality Test for '{col}': p-value={p_value:.4f}, is_normal={is_normal}")
            self.logger.debug(f"Skewness: {skewness:.4f}, Kurtosis: {kurtosis:.4f}")

        self.normality_results = normality_results
        self.preprocessing_steps.append(step_name)
        self.logger.info(f"Completed: {step_name}. Dataset shape: {X.shape}")

        if self.debug:
            for col, stats in normality_results.items():
                self.logger.debug(
                    f"Feature '{col}': Skewness={stats['skewness']:.4f}, Kurtosis={stats['kurtosis']:.4f}, "
                    f"P-Value={stats['p_value']:.4f}, Test Used={stats['test_used']}, Is Normal={stats['is_normal']}"
                )

        return normality_results

    def handle_outliers(self, X: pd.DataFrame) -> pd.DataFrame:
        """
        Handle outliers based on the model's sensitivity.

        Args:
            X (pd.DataFrame): Input features.

        Returns:
            pd.DataFrame: Features with outliers handled.
        """
        step_name = "Handling Outliers"
        self.logger.info(f"Step: {step_name}")
        for col in self.numericals:
            initial_shape = X.shape[0]
            if self.model_type in ['Logistic Regression', 'Linear Regression']:
                # Z-Score Method
                z_scores = np.abs((X[col] - X[col].mean()) / X[col].std())
                X = X[z_scores < 3]
                self.feature_reasons[col] += 'Outliers handled with Z-Score | '
                self.logger.debug(f"Removed {initial_shape - X.shape[0]} outliers from '{col}' using Z-Score")
                initial_shape = X.shape[0]

                # Tukey's IQR Method
                Q1 = X[col].quantile(0.25)
                Q3 = X[col].quantile(0.75)
                IQR = Q3 - Q1
                lower_bound = Q1 - 1.5 * IQR
                upper_bound = Q3 + 1.5 * IQR
                before_iqr_shape = X.shape[0]
                X = X[(X[col] >= lower_bound) & (X[col] <= upper_bound)]
                after_iqr_shape = X.shape[0]
                self.feature_reasons[col] += 'Outliers handled with IQR | '
                self.logger.debug(f"Removed {before_iqr_shape - after_iqr_shape} outliers from '{col}' using IQR")

            elif self.model_type in ['SVM', 'k-NN']:
                # Tukey's IQR Method
                Q1 = X[col].quantile(0.25)
                Q3 = X[col].quantile(0.75)
                IQR = Q3 - Q1
                lower_bound = Q1 - 1.5 * IQR
                upper_bound = Q3 + 1.5 * IQR
                before_iqr_shape = X.shape[0]
                X = X[(X[col] >= lower_bound) & (X[col] <= upper_bound)]
                after_iqr_shape = X.shape[0]
                self.feature_reasons[col] += 'Outliers handled with IQR | '
                self.logger.debug(f"Removed {before_iqr_shape - after_iqr_shape} outliers from '{col}' using IQR")

                # Winsorization
                X[col] = X[col].clip(lower_bound, upper_bound)
                self.feature_reasons[col] += 'Outliers handled with Winsorization | '
                self.logger.debug(f"Winsorized '{col}' to bounds ({lower_bound}, {upper_bound})")

            elif self.model_type == 'Neural Networks':
                # Winsorization
                Q1 = X[col].quantile(0.25)
                Q3 = X[col].quantile(0.75)
                IQR = Q3 - Q1
                lower_bound = Q1 - 1.5 * IQR
                upper_bound = Q3 + 1.5 * IQR
                X[col] = X[col].clip(lower_bound, upper_bound)
                self.feature_reasons[col] += 'Outliers handled with Winsorization | '
                self.logger.debug(f"Winsorized '{col}' to bounds ({lower_bound}, {upper_bound})")

            elif self.model_type == 'Clustering':
                # Isolation Forest
                iso = IsolationForest(contamination=0.05, random_state=42)
                preds = iso.fit_predict(X[[col]])
                before_iso_shape = X.shape[0]
                X = X[preds == 1]
                after_iso_shape = X.shape[0]
                self.feature_reasons[col] += 'Outliers handled with Isolation Forest | '
                self.logger.debug(f"Removed {before_iso_shape - after_iso_shape} outliers from '{col}' using Isolation Forest")

            elif self.model_type == 'Tree-Based Models':
                # Tree-Based Models are robust to outliers; optional handling
                self.logger.debug(f"No outlier handling for '{col}' as Tree-Based Models are robust to outliers.")

            elif self.model_type == 'Time Series Models':
                # Rolling Statistics (Smoothing)
                X[col] = X[col].rolling(window=3, min_periods=1).mean()
                self.feature_reasons[col] += 'Applied Rolling Statistics (Smoothing) | '
                self.logger.debug(f"Applied Rolling Statistics to '{col}'")

                # Winsorization
                Q1 = X[col].quantile(0.25)
                Q3 = X[col].quantile(0.75)
                IQR = Q3 - Q1
                lower_bound = Q1 - 1.5 * IQR
                upper_bound = Q3 + 1.5 * IQR
                X[col] = X[col].clip(lower_bound, upper_bound)
                self.feature_reasons[col] += 'Outliers handled with Winsorization | '
                self.logger.debug(f"Winsorized '{col}' to bounds ({lower_bound}, {upper_bound})")

            # Update key feature statistics after outlier handling
            if self.debug:
                self.logger.debug(f"Post Outlier Handling '{col}':\n{X[col].describe()}")

        self.preprocessing_steps.append(step_name)
        self.logger.info(f"Completed: {step_name}. Dataset shape: {X.shape}")
        if self.debug:
            self.logger.debug(f"Columns after {step_name}: {X.columns.tolist()}")

        return X

    def choose_transformation(self, X: pd.DataFrame) -> pd.DataFrame:
        """
        Choose and apply transformations based on skewness.

        Args:
            X (pd.DataFrame): Input features.

        Returns:
            pd.DataFrame: Transformed features.
        """
        step_name = "Choosing and Applying Transformations"
        self.logger.info(f"Step: {step_name}")
        # Apply PowerTransformer to all numerical features together
        if self.numericals:
            skewed_features = [col for col in self.numericals if abs(X[col].skew()) > 0.75]
            if skewed_features:
                self.transformer = PowerTransformer(method='yeo-johnson')  # Yeo-Johnson handles zero and negative values
                X[self.numericals] = self.transformer.fit_transform(X[self.numericals])
                for col in self.numericals:
                    self.feature_reasons[col] += 'Applied PowerTransformer (Yeo-Johnson) | '
                self.preprocessing_steps.append(step_name)
                self.logger.info(f"Applied PowerTransformer to {len(self.numericals)} numerical features.")
                if self.debug:
                    self.logger.debug(f"DataFrame shape after {step_name}: {X.shape}")
                    self.logger.debug(f"Columns after {step_name}: {X.columns.tolist()}")
            else:
                self.logger.info("No significant skewness detected. No transformations applied.")
        else:
            self.logger.info("No numerical features to transform.")

        return X

    def encode_categorical(self, X_train: pd.DataFrame, X_test: Optional[pd.DataFrame]) -> Tuple[pd.DataFrame, Optional[pd.DataFrame]]:
        """
        Encode categorical variables using OrdinalEncoder for both ordinal and nominal features.

        Args:
            X_train (pd.DataFrame): Training features.
            X_test (pd.DataFrame): Testing features.

        Returns:
            tuple: Encoded X_train and X_test.
        """
        step_name = "Encoding Categorical Variables"
        self.logger.info(f"Step: {step_name}")

        # Define transformers for ordinal and nominal categorical features
        transformers = []
        if self.ordinal_categoricals:
            transformers.append(
                ('ordinal', OrdinalEncoder(), self.ordinal_categoricals)
            )
        if self.nominal_categoricals:
            # Use separate OrdinalEncoder for nominal features
            nominal_transformer = Pipeline(steps=[
                ('ordinal_encoder', OrdinalEncoder(handle_unknown='use_encoded_value', unknown_value=-1))
            ])
            transformers.append(
                ('nominal', nominal_transformer, self.nominal_categoricals)
            )

        if not transformers:
            self.logger.info("No categorical variables to encode.")
            return X_train, X_test

        # Create ColumnTransformer for encoding
        self.preprocessor = ColumnTransformer(transformers=transformers, remainder='passthrough')

        # Fit and transform training data
        X_train_encoded = self.preprocessor.fit_transform(X_train)
        self.logger.debug("Fitted and transformed X_train with ColumnTransformer.")

        # Transform testing data
        if X_test is not None:
            X_test_encoded = self.preprocessor.transform(X_test)
            self.logger.debug("Transformed X_test with fitted ColumnTransformer.")
        else:
            X_test_encoded = None

        # Retrieve feature names after encoding
        encoded_feature_names = []
        if self.ordinal_categoricals:
            encoded_feature_names += self.ordinal_categoricals
        if self.nominal_categoricals:
            # Generate encoded names for nominal features
            self.nominal_categorical_encoded = [f"{col}_encoded" for col in self.nominal_categoricals]
            encoded_feature_names += self.nominal_categorical_encoded

        # Include passthrough (numericals)
        passthrough_features = [col for col in X_train.columns if col not in self.ordinal_categoricals + self.nominal_categoricals]
        encoded_feature_names += passthrough_features

        # Convert numpy arrays back to DataFrames
        X_train_encoded_df = pd.DataFrame(X_train_encoded, columns=encoded_feature_names, index=X_train.index)
        if X_test_encoded is not None:
            X_test_encoded_df = pd.DataFrame(X_test_encoded, columns=encoded_feature_names, index=X_test.index)
        else:
            X_test_encoded_df = None

        # Store encoders for inverse transformation
        self.ordinal_encoder = self.preprocessor.named_transformers_['ordinal'] if 'ordinal' in self.preprocessor.named_transformers_ else None
        self.nominal_encoder = self.preprocessor.named_transformers_['nominal'].named_steps['ordinal_encoder'] if 'nominal' in self.preprocessor.named_transformers_ else None

        # Since we're using OrdinalEncoder, maintain a list of nominal categorical encoded indices
        self.nominal_categorical_indices = [X_train_encoded_df.columns.get_loc(col) for col in self.nominal_categorical_encoded]

        self.preprocessing_steps.append(step_name)
        self.logger.info(f"Completed: Encoding Categorical Variables. X_train_encoded shape: {X_train_encoded_df.shape}")

        if self.debug:
            self.logger.debug(f"DataFrame shape after Encoding Categorical Variables: {X_train_encoded_df.shape}")
            self.logger.debug(f"Columns after Encoding Categorical Variables: {X_train_encoded_df.columns.tolist()}")
            for col in self.ordinal_categoricals:
                self.logger.debug(f"Encoded '{col}' - Sample Values: {X_train_encoded_df[col].dropna().unique()[:5]}")
            for col in self.nominal_categoricals:
                self.logger.debug(f"Encoded '{col}_encoded' - Encoded Values: {X_train_encoded_df[f'{col}_encoded'].dropna().unique()[:5]}")
            # Print encoder categories
            if self.ordinal_encoder:
                self.logger.debug(f"Ordinal Encoder Categories: {self.ordinal_encoder.categories_}")
            if self.nominal_encoder:
                self.logger.debug(f"Nominal Encoder Categories: {self.nominal_encoder.categories_}")

        return X_train_encoded_df, X_test_encoded_df

    def apply_scaling(self, X_train: pd.DataFrame, X_test: Optional[pd.DataFrame]) -> Tuple[pd.DataFrame, Optional[pd.DataFrame]]:
        """
        Apply scaling based on the model type.

        Args:
            X_train (pd.DataFrame): Training features.
            X_test (pd.DataFrame): Testing features.

        Returns:
            tuple: Scaled X_train and X_test
        """
        step_name = "Applying Scaling"
        self.logger.info(f"Step: {step_name}")

        scaler = None
        scaling_type = 'None'

        if self.model_type in ['Logistic Regression', 'Neural Networks']:
            scaler = StandardScaler()
            scaling_type = 'StandardScaler'
        elif self.model_type in ['SVM', 'k-NN', 'Clustering']:
            scaler = MinMaxScaler()
            scaling_type = 'MinMaxScaler'

        if scaler:
            self.scaler = scaler
            X_train[self.numericals] = scaler.fit_transform(X_train[self.numericals])
            if X_test is not None:
                X_test[self.numericals] = scaler.transform(X_test[self.numericals])
            for col in self.numericals:
                self.feature_reasons[col] += f'Scaling Applied: {scaling_type} | '
            self.preprocessing_steps.append(step_name)
            self.logger.info(f"Applied {scaling_type} to numerical features.")

            if self.debug:
                self.logger.debug(f"DataFrame shape after {step_name}: {X_train.shape}")
                self.logger.debug(f"Columns after {step_name}: {X_train.columns.tolist()}")
                self.logger.debug(f"Scaler Parameters: mean={scaler.mean_}, scale={scaler.scale_}")
                for col in self.numericals:
                    self.logger.debug(f"Scaled '{col}' - Data Type: {X_train[col].dtype}, Sample Values: {X_train[col].head().values}")
        else:
            self.logger.info("No Scaling Applied as per Model Type.")

        return X_train, X_test
    
    def choose_transformation(self, X: pd.DataFrame) -> pd.DataFrame:
        """
        Choose and apply transformations based on skewness.

        Args:
            X (pd.DataFrame): Input features.

        Returns:
            pd.DataFrame: Transformed features.
        """
        step_name = "Choosing and Applying Transformations"
        self.logger.info(f"Step: {step_name}")
        # Apply PowerTransformer to all numerical features together
        if self.numericals:
            skewed_features = [col for col in self.numericals if abs(X[col].skew()) > 0.75]
            if skewed_features:
                self.transformer = PowerTransformer(method='yeo-johnson')  # Yeo-Johnson handles zero and negative values
                X[self.numericals] = self.transformer.fit_transform(X[self.numericals])
                for col in self.numericals:
                    self.feature_reasons[col] += 'Applied PowerTransformer (Yeo-Johnson) | '
                self.preprocessing_steps.append(step_name)
                self.logger.info(f"Applied PowerTransformer to {len(self.numericals)} numerical features.")
                if self.debug:
                    self.logger.debug(f"DataFrame shape after {step_name}: {X.shape}")
                    self.logger.debug(f"Columns after {step_name}: {X.columns.tolist()}")
            else:
                self.logger.info("No significant skewness detected. No transformations applied.")
        else:
            self.logger.info("No numerical features to transform.")

        return X

    def split_dataset(self, X: pd.DataFrame, y: pd.Series) -> Tuple[pd.DataFrame, Optional[pd.DataFrame], pd.Series, Optional[pd.Series]]:
        """
        Split the dataset into training and testing sets.

        Args:
            X (pd.DataFrame): Features.
            y (pd.Series): Target variable.

        Returns:
            tuple: X_train, X_test, y_train, y_test
        """
        step_name = "Splitting Dataset into Train and Test Sets"
        self.logger.info(f"Step: {step_name}")
        if self.perform_split:
            X_train, X_test, y_train, y_test = train_test_split(
                X, y, test_size=0.2, stratify=y, random_state=42
            )
            self.preprocessing_steps.append(step_name)
            self.logger.info(f"Completed: {step_name}. X_train shape: {X_train.shape}, X_test shape: {X_test.shape}")
            if self.debug:
                self.logger.debug(f"Columns in X_train: {X_train.columns.tolist()}")
                self.logger.debug(f"Sample of y_train distribution:\n{y_train.value_counts()}")
            return X_train, X_test, y_train, y_test
        else:
            self.logger.info("Train-Test Split Skipped as perform_split=False")
            return X, None, y, None
        
    def implement_smote(self, X_train: pd.DataFrame, y_train: pd.Series) -> Tuple[pd.DataFrame, pd.Series]:
        """
        Implement SMOTE or SMOTENC based on the presence of categorical features.

        Args:
            X_train (pd.DataFrame): Training features.
            y_train (pd.Series): Training target.

        Returns:
            tuple: Resampled X_train and y_train.
        """
        step_name = "Implementing SMOTE for Class Imbalance"
        self.logger.info(f"Step: {step_name}")

        # Calculate class distribution
        class_counts = y_train.value_counts()
        majority_class = class_counts.idxmax()
        minority_class = class_counts.idxmin()
        majority_count = class_counts.max()
        minority_count = class_counts.min()
        imbalance_ratio = minority_count / majority_count
        self.logger.info(f"Class Distribution before SMOTE: {class_counts.to_dict()}")
        self.logger.info(f"Imbalance Ratio (Minority/Majority): {imbalance_ratio:.4f}")

        # Identify categorical feature indices for SMOTENC
        categorical_features = []

        # For ordinal_categoricals (already encoded as ordinal integers)
        for col in self.ordinal_categoricals:
            try:
                idx = X_train.columns.get_loc(col)
                categorical_features.append(idx)
            except KeyError:
                self.logger.error(f"Categorical feature '{col}' not found in X_train columns.")
                raise

        # For nominal_categoricals (encoded as 'gender_encoded', 'city_encoded')
        for col in self.nominal_categorical_encoded:
            try:
                idx = X_train.columns.get_loc(col)
                categorical_features.append(idx)
            except KeyError:
                self.logger.error(f"Nominal categorical feature '{col}' not found in X_train columns.")
                raise

        # Initialize SMOTENC
        if categorical_features:
            smote = SMOTENC(categorical_features=categorical_features, random_state=42)
            smote_variant = "SMOTENC"
            reason = "Mixed numerical and categorical features."
        else:
            smote = SMOTE(random_state=42)
            smote_variant = "SMOTE"
            reason = "Numerical features only."

        # Apply SMOTE
        if smote:
            X_res, y_res = smote.fit_resample(X_train, y_train)
            self.smote = smote
            self.preprocessing_steps.append(step_name)
            self.logger.info(f"Selected SMOTE Variant: {smote_variant}")
            self.logger.info(f"Reason for Selection: {reason}")
            self.logger.info(f"SMOTE Completed. Resampled X_train shape: {X_res.shape}, y_train shape: {y_res.shape}")
            if self.debug:
                self.logger.debug(f"Class distribution after SMOTE:\n{pd.Series(y_res).value_counts()}")
            return X_res, y_res
        else:
            self.logger.warning("No suitable SMOTE variant found. Skipping SMOTE.")
            return X_train, y_train

    def preprocessor_recommendations(self) -> pd.DataFrame:
        """
        Generate a table of preprocessing recommendations.

        Returns:
            pd.DataFrame: Recommendations table.
        """
        step_name = "Generating Preprocessor Recommendations"
        self.logger.info(f"Step: {step_name}")
        recommendations_table = pd.DataFrame.from_dict(
            self.feature_reasons, 
            orient='index', 
            columns=['Preprocessing Reason']
        )
        self.logger.debug(f"Preprocessing Recommendations:\n{recommendations_table}")
        self.logger.info(f"Completed: {step_name}")
        self.preprocessing_steps.append(step_name)
        return recommendations_table

    def final_preprocessing(
        self, 
        X: pd.DataFrame, 
        y: pd.Series
    ) -> Tuple[pd.DataFrame, Optional[pd.DataFrame], pd.Series, Optional[pd.Series]]:
        """
        Execute the full preprocessing pipeline in the correct order.

        Args:
            X (pd.DataFrame): Features.
            y (pd.Series): Target variable.

        Returns:
            tuple: X_train, X_test, y_train, y_test
        """
        step_name = "Final Preprocessing Pipeline"
        self.logger.info(f"Starting: {step_name}")

        # Step 1: Handle Missing Values
        X = self.handle_missing_values(X)

        # Step 2: Test for Normality
        normality_results = self.test_normality(X)

        # Step 3: Handle Outliers
        X = self.handle_outliers(X)

        # Step 4: Choose and Apply Transformations
        X = self.choose_transformation(X)

        # Step 5: Split Dataset
        X_train, X_test, y_train, y_test = self.split_dataset(X, y)

        # Step 6: Encode Categorical Variables
        if self.perform_split:
            X_train, X_test = self.encode_categorical(X_train, X_test)

            # Step 7: Apply Scaling on Training and Testing Data
            X_train, X_test = self.apply_scaling(X_train, X_test)

            # Step 8: Implement SMOTENC on Training Data Only
            if y_train.value_counts().min() < y_train.value_counts().max():
                X_train, y_train = self.implement_smote(X_train, y_train)

        self.preprocessing_steps.append(step_name)
        self.logger.info(f"Completed: {step_name}")
        return X_train, X_test, y_train, y_test

    def final_inversetransform(self, X_preprocessed: pd.DataFrame, X_original: pd.DataFrame) -> pd.DataFrame:
        """
        Perform inverse transformations to revert preprocessed data back to its original form.

        Args:
            X_preprocessed (pd.DataFrame): Preprocessed features.
            X_original (pd.DataFrame): Original features before preprocessing.

        Returns:
            pd.DataFrame: Inverse-transformed DataFrame.
        """
        step_name = "Inverse Transformation of Preprocessed Data"
        self.logger.info(f"Starting: {step_name}")

        # Debug: Print columns in X_original
        self.logger.debug(f"Columns in X_original during inverse_transform: {X_original.columns.tolist()}")

        # Initialize DataFrame for inverse transformed data
        X_inverse = pd.DataFrame(index=X_preprocessed.index)

        # Inverse Scaling
        if self.scaler:
            try:
                X_inverse[self.numericals] = self.scaler.inverse_transform(X_preprocessed[self.numericals])
                self.logger.debug("Inverse Scaling Completed")
                for col in self.numericals:
                    self.feature_reasons[col] += 'Inverse Scaling Applied | '
            except Exception as e:
                self.logger.error(f"Error during inverse Scaling: {e}")
                raise
        else:
            X_inverse[self.numericals] = X_preprocessed[self.numericals]

        # Inverse Transformation (PowerTransformer)
        if self.transformer:
            try:
                X_inverse[self.numericals] = self.transformer.inverse_transform(X_inverse[self.numericals])
                self.logger.debug("Inverse Transformation Completed")
                for col in self.numericals:
                    self.feature_reasons[col] += 'Inverse Transformation Applied | '
            except Exception as e:
                self.logger.error(f"Error during inverse Transformation: {e}")
                raise

        # Inverse Encoding for Ordinal Categorical Features
        if self.ordinal_categoricals and self.ordinal_encoder:
            try:
                X_inverse[self.ordinal_categoricals] = self.ordinal_encoder.inverse_transform(X_preprocessed[self.ordinal_categoricals])
                self.logger.debug("Inverse Ordinal Encoding for Ordinal Categorical Features Completed")
                for col in self.ordinal_categoricals:
                    self.feature_reasons[col] += 'Inverse Ordinal Encoding Applied | '
            except Exception as e:
                self.logger.error(f"Error during inverse Ordinal Encoding for Ordinal Categorical Features: {e}")
                raise

        # Inverse Encoding for Nominal Categorical Features
        if self.nominal_categoricals and self.nominal_encoder:
            try:
                # Inverse transform all nominal categorical features together
                X_inverse[self.nominal_categoricals] = self.nominal_encoder.inverse_transform(X_preprocessed[self.nominal_categorical_encoded])
                self.logger.debug("Inverse Ordinal Encoding for Nominal Categorical Features Completed")
                for col in self.nominal_categoricals:
                    self.feature_reasons[col] += 'Inverse Ordinal Encoding Applied | '
            except Exception as e:
                self.logger.error(f"Error during inverse Ordinal Encoding for Nominal Categorical Features: {e}")
                raise

        # Combine all features
        try:
            # Identify columns that were transformed
            transformed_cols = self.numericals + self.ordinal_categoricals + self.nominal_categoricals

            # Include passthrough (non-transformed) features
            # Since all features are transformed, no passthrough columns should exist
            non_transformed_cols = [col for col in X_original.columns if col not in transformed_cols]

            if non_transformed_cols:
                X_inverse = pd.concat([X_inverse, X_preprocessed[non_transformed_cols]], axis=1)

            # Reorder columns to match the original DataFrame
            X_final_inverse = X_inverse[X_original.columns]
        except Exception as e:
            self.logger.error(f"Error during combining inverse transformed data: {e}")
            raise

        self.preprocessing_steps.append(step_name)
        self.logger.info(f"Completed: {step_name}")

        return X_final_inverse



# main_preprocessing.py

import pandas as pd
import numpy as np
import pickle
from typing import List, Dict, Optional, Tuple
import logging
import os


def validate_inverse(original_df: pd.DataFrame, inverse_df: pd.DataFrame, numericals: list, categorical_features: list, tolerance: float = 1e-4):
    """
    Validate the inverse transformation by comparing original and inverse-transformed data.

    Args:
        original_df (pd.DataFrame): Original DataFrame before preprocessing.
        inverse_df (pd.DataFrame): Inverse-transformed DataFrame.
        numericals (list): List of numerical feature names.
        categorical_features (list): List of categorical feature names.
        tolerance (float): Tolerance for numerical differences.
    """
    logger = logging.getLogger('Validation')
    differences = {}

    for col in categorical_features:
        diff = original_df[col].astype(str) != inverse_df[col].astype(str)
        differences[col] = {
            'total_differences': diff.sum(),
            'percentage_differences': (diff.sum() / len(diff)) * 100
        }

    for col in numericals:
        diff = np.abs(original_df[col] - inverse_df[col]) > tolerance
        differences[col] = {
            'total_differences': diff.sum(),
            'percentage_differences': (diff.sum() / len(diff)) * 100
        }

    # Display the differences
    for col, stats in differences.items():
        print(f"Column: {col}")
        print(f" - Total Differences: {stats['total_differences']}")
        print(f" - Percentage Differences: {stats['percentage_differences']:.2f}%\n")

    # Detailed differences
    for col in differences:
        if differences[col]['total_differences'] > 0:
            print(f"Differences found in column '{col}':")
            mask = (original_df[col].astype(str) != inverse_df[col].astype(str)) if col in categorical_features else (np.abs(original_df[col] - inverse_df[col]) > tolerance)
            comparison = pd.concat([
                original_df.loc[mask, col].reset_index(drop=True).rename('Original'),
                inverse_df.loc[mask, col].reset_index(drop=True).rename('Inverse Transformed')
            ], axis=1)
            print(comparison)
            print("\n")

    # Check if indices are aligned
    if not original_df.index.equals(inverse_df.index):
        print("Warning: Indices of original and inverse transformed data do not match.")
    else:
        print("Success: Indices of original and inverse transformed data are aligned.")


def main():
    # Define the path to the single pickle file containing all metadata
    save_path = '../../data/model/pipeline/features_metadata.pkl'  # Adjust as needed
    dataset_csv_path = '../../ml-preprocessing-utils/data/dataset/test/test_ml_dataset.csv'  # Ensure this path exists

    # Define a debug flag based on user preference
    debug_flag = True  # Set to False for minimal outputs

    # Configure root logger
    logging.basicConfig(
        level=logging.DEBUG if debug_flag else logging.INFO, 
        format='%(asctime)s [%(levelname)s] %(message)s'
    )
    logger = logging.getLogger('main_preprocessing')


    # **Loading Process:**
    # Load features and metadata using manage_features
    loaded = manage_features(
        mode='load',
        save_path=save_path
    )

    # Access loaded data
    if loaded:
        features = loaded.get('features')
        ordinals = loaded.get('ordinal_categoricals')
        nominals = loaded.get('nominal_categoricals')
        nums = loaded.get('numericals')
        y_var = loaded.get('y_variable')
        loaded_dataset_path = loaded.get('dataset_csv_path')  # Correct key

        print("\n📥 Loaded Data:")
        print("Features:", features)
        print("Ordinal Categoricals:", ordinals)
        print("Nominal Categoricals:", nominals)
        print("Numericals:", nums)
        print("Y Variable:", y_var)
        print("Dataset Path:", loaded_dataset_path)

    else:
        logger.error("Failed to load features and metadata.")
        return  # Exit the main function if loading fails

    # Load the selected features data using the loaded dataset path and metadata
    try:
        final_ml_df_selected_features, column_assets = load_selected_features_data(
            loaded_data=loaded,  # Pass the entire loaded data dictionary
            debug=debug_flag
        )
    except Exception as e:
        logger.error(f"Failed to load selected features data: {e}")
        return  # Exit if data loading fails
    
    # Initialize the DataPreprocessor
    preprocessor = DataPreprocessor(
        model_type='Logistic Regression',
        column_assets=column_assets,
        perform_split=True,
        debug=debug_flag
    )
    
    # Generate and display preprocessing recommendations
    recommendations = preprocessor.preprocessor_recommendations()
    print("\nPreprocessing Recommendations:")
    print(recommendations)
    
    # Execute the final preprocessing
    X = final_ml_df_selected_features.drop(y_variable, axis=1)
    y = final_ml_df_selected_features[y_variable]
    X_train, X_test, y_train, y_test = preprocessor.final_preprocessing(X, y)


    # Display the shapes of the preprocessed datasets
    print("\nPreprocessed Datasets Shapes:")
    print(f"X_train: {X_train.shape}")
    print(f"X_test: {X_test.shape}")
    print(f"y_train: {y_train.shape}")
    print(f"y_test: {y_test.shape}")

    # Display key features after preprocessing
    if debug_flag:
        print("\nPerforming Inverse Transformation on Test Set Samples...")
        try:
            # Inverse Transform Only Test Set
            inverse_transformed_test = preprocessor.final_inversetransform(X_test, X.loc[X_test.index])

            print("\nInverse Transformed X_test:")
            print(inverse_transformed_test.head())

            # Compare inverse transformed test data with original test data
            original_test_data = X.loc[X_test.index].copy()
            inverse_transformed_subset = inverse_transformed_test.copy()

            # Perform per-column comparison
            differences = {}
            tolerance = 1e-4  # Define tolerance for numerical differences

            validate_inverse(
                original_df=original_test_data,
                inverse_df=inverse_transformed_subset,
                numericals=preprocessor.numericals,
                categorical_features=preprocessor.ordinal_categoricals + preprocessor.nominal_categoricals,
                tolerance=tolerance
            )

        except AttributeError as ae:
            print(f"\nInverse Transformation Failed: {ae}")
        except Exception as e:
            print(f"\nAn unexpected error occurred during inverse transformation: {e}")

if __name__ == "__main__":
    main()

