# Basic KMeans use for machine learning


Automating the preprocessing and selection of appropriate clustering algorithms—**KMeans**, **KModes**, and **KPrototypes**—requires a systematic approach that evaluates the dataset's characteristics and applies suitable preprocessing steps. Below is a detailed breakdown of how to achieve this automation, including dataset requirements for each algorithm and a defined decision path for their usage.

## Overview

1. **Dataset Analysis**: Determine the types of features (numerical, categorical, or mixed) and assess data quality (missing values, outliers, scaling).
2. **Preprocessing Steps**: Apply transformations based on feature types and data quality.
3. **Algorithm Selection**: Choose the appropriate clustering algorithm based on feature types.
4. **Model Execution and Evaluation**: Apply the selected algorithm and evaluate clustering performance.

## Step-by-Step Automation Process

### 1. Dataset Analysis

**Objective**: Identify the nature of the dataset to determine suitable preprocessing steps and the appropriate clustering algorithm.

**Actions**:

- **Identify Feature Types**:
  - **Numerical Features**: Continuous or discrete numerical values.
  - **Categorical Features**: Nominal or ordinal data.
  - **Mixed Features**: A combination of numerical and categorical features.

- **Assess Data Quality**:
  - **Missing Values**: Check for any null or missing entries.
  - **Outliers**: Detect and evaluate outliers in numerical data.
  - **Scaling**: Determine if numerical features are on different scales.
  - **Dimensionality**: Evaluate the number of features and their correlations.

**Implementation**:

Use libraries like `pandas` and `numpy` for data inspection.

```python
import pandas as pd
import numpy as np

def analyze_dataset(df):
    feature_types = {}
    for column in df.columns:
        if pd.api.types.is_numeric_dtype(df[column]):
            feature_types[column] = 'numerical'
        else:
            feature_types[column] = 'categorical'
    
    num_features = [col for col, typ in feature_types.items() if typ == 'numerical']
    cat_features = [col for col, typ in feature_types.items() if typ == 'categorical']
    
    missing_values = df.isnull().sum()
    outliers = {}
    for col in num_features:
        Q1 = df[col].quantile(0.25)
        Q3 = df[col].quantile(0.75)
        IQR = Q3 - Q1
        outliers[col] = df[(df[col] < (Q1 - 1.5 * IQR)) | (df[col] > (Q3 + 1.5 * IQR))].shape[0]
    
    correlations = df[num_features].corr().abs()
    upper_tri = correlations.where(np.triu(np.ones(correlations.shape), k=1).astype(bool))
    highly_correlated = [column for column in upper_tri.columns if any(upper_tri[column] > 0.9)]
    
    return {
        'feature_types': feature_types,
        'num_features': num_features,
        'cat_features': cat_features,
        'missing_values': missing_values,
        'outliers': outliers,
        'highly_correlated': highly_correlated
    }
```

### 2. Preprocessing Steps

**Objective**: Clean and prepare data based on its characteristics to ensure optimal performance of the clustering algorithms.

**Actions**:

- **Handling Missing Data**:
  - **Numerical Features**: Impute with mean or median.
  - **Categorical Features**: Impute with mode or a new category (e.g., 'Unknown').

- **Removing Outliers**:
  - Apply techniques like IQR filtering or Z-score thresholding for numerical data.

- **Scaling Numerical Features**:
  - Standardize (zero mean, unit variance) or normalize (range [0,1]) to ensure all features contribute equally.

- **Encoding Categorical Features**:
  - **KModes/KPrototypes**: Typically, these algorithms handle categorical data directly, so encoding is not mandatory. However, ensure consistent data formats.
  - **KPrototypes**: Requires specifying categorical feature indices.

- **Dimensionality Reduction**:
  - Apply PCA for numerical data to reduce dimensions if necessary.

**Implementation**:

```python
from sklearn.impute import SimpleImputer
from sklearn.preprocessing import StandardScaler, MinMaxScaler
from sklearn.decomposition import PCA

def preprocess_data(df, analysis_results):
    df_clean = df.copy()
    
    # Handling Missing Values
    if analysis_results['num_features']:
        num_imputer = SimpleImputer(strategy='median')
        df_clean[analysis_results['num_features']] = num_imputer.fit_transform(df_clean[analysis_results['num_features']])
    
    if analysis_results['cat_features']:
        cat_imputer = SimpleImputer(strategy='most_frequent')
        df_clean[analysis_results['cat_features']] = cat_imputer.fit_transform(df_clean[analysis_results['cat_features']])
    
    # Removing Outliers
    for col in analysis_results['num_features']:
        Q1 = df_clean[col].quantile(0.25)
        Q3 = df_clean[col].quantile(0.75)
        IQR = Q3 - Q1
        df_clean = df_clean[(df_clean[col] >= (Q1 - 1.5 * IQR)) & (df_clean[col] <= (Q3 + 1.5 * IQR))]
    
    # Scaling Numerical Features
    scaler = StandardScaler()
    df_clean[analysis_results['num_features']] = scaler.fit_transform(df_clean[analysis_results['num_features']])
    
    # Dimensionality Reduction (optional)
    if len(analysis_results['num_features']) > 10:
        pca = PCA(n_components=10)
        df_clean[analysis_results['num_features']] = pca.fit_transform(df_clean[analysis_results['num_features']])
    
    return df_clean
```

### 3. Algorithm Selection

**Objective**: Automatically choose the most suitable clustering algorithm based on the dataset's feature types.

**Decision Criteria**:

- **All Numerical Features**:
  - **Algorithm**: KMeans
  - **Reason**: KMeans is optimized for numerical data using distance metrics.

- **All Categorical Features**:
  - **Algorithm**: KModes
  - **Reason**: KModes is designed to handle categorical data by minimizing dissimilarities based on mode.

- **Mixed Numerical and Categorical Features**:
  - **Algorithm**: KPrototypes
  - **Reason**: KPrototypes combines KMeans and KModes to handle mixed data types.

**Implementation**:

```python
from kmodes.kmodes import KModes
from kmodes.kprototypes import KPrototypes
from sklearn.cluster import KMeans

def select_clustering_algorithm(analysis_results):
    if analysis_results['num_features'] and not analysis_results['cat_features']:
        return 'KMeans'
    elif analysis_results['cat_features'] and not analysis_results['num_features']:
        return 'KModes'
    elif analysis_results['num_features'] and analysis_results['cat_features']:
        return 'KPrototypes'
    else:
        raise ValueError("Dataset must contain at least one numerical or categorical feature.")
```

### 4. Model Execution and Evaluation

**Objective**: Apply the selected clustering algorithm and evaluate its performance.

**Actions**:

- **Determine Optimal Number of Clusters (K)**:
  - Use methods like Elbow, Silhouette, or Gap Statistics.

- **Fit the Model**:
  - Apply the chosen algorithm with the optimal K.

- **Evaluate Clustering**:
  - Use metrics such as Silhouette Score, Davies-Bouldin Index, or domain-specific evaluations.

**Implementation**:

```python
from sklearn.metrics import silhouette_score
import matplotlib.pyplot as plt

def determine_optimal_k_kmeans(df, max_k=10):
    cost = []
    K = range(1, max_k+1)
    for k in K:
        kmeans = KMeans(n_clusters=k, random_state=0)
        kmeans.fit(df)
        cost.append(kmeans.inertia_)
    plt.plot(K, cost, 'bx-')
    plt.xlabel('Number of clusters (k)')
    plt.ylabel('Inertia')
    plt.title('Elbow Method For Optimal k')
    plt.show()
    # Choose k where the elbow appears
    # This step can be automated using derivative methods or left for manual inspection
    return 3  # Placeholder

def determine_optimal_k_kmodes(df, max_k=10):
    cost = []
    K = range(1, max_k+1)
    for k in K:
        kmodes = KModes(n_clusters=k, init='Huang', n_init=5, verbose=0)
        kmodes.fit_predict(df)
        cost.append(kmodes.cost_)
    plt.plot(K, cost, 'bx-')
    plt.xlabel('Number of clusters (k)')
    plt.ylabel('Cost')
    plt.title('Elbow Method For Optimal k (KModes)')
    plt.show()
    return 3  # Placeholder

def determine_optimal_k_kprototypes(df, categorical_indices, max_k=10):
    cost = []
    K = range(1, max_k+1)
    for k in K:
        kproto = KPrototypes(n_clusters=k, init='Cao', verbose=0)
        kproto.fit_predict(df, categorical=categorical_indices)
        cost.append(kproto.cost_)
    plt.plot(K, cost, 'bx-')
    plt.xlabel('Number of clusters (k)')
    plt.ylabel('Cost')
    plt.title('Elbow Method For Optimal k (KPrototypes)')
    plt.show()
    return 3  # Placeholder

def execute_clustering(df, analysis_results, algorithm):
    if algorithm == 'KMeans':
        optimal_k = determine_optimal_k_kmeans(df[analysis_results['num_features']])
        model = KMeans(n_clusters=optimal_k, random_state=0)
        model.fit(df[analysis_results['num_features']])
        labels = model.labels_
        score = silhouette_score(df[analysis_results['num_features']], labels)
        return model, labels, score
    
    elif algorithm == 'KModes':
        optimal_k = determine_optimal_k_kmodes(df[analysis_results['cat_features']])
        kmodes = KModes(n_clusters=optimal_k, init='Huang', n_init=5, verbose=0)
        labels = kmodes.fit_predict(df[analysis_results['cat_features']])
        score = silhouette_score(pd.get_dummies(df[analysis_results['cat_features']]), labels)
        return kmodes, labels, score
    
    elif algorithm == 'KPrototypes':
        categorical_indices = [df.columns.get_loc(col) for col in analysis_results['cat_features']]
        optimal_k = determine_optimal_k_kprototypes(df, categorical_indices)
        kproto = KPrototypes(n_clusters=optimal_k, init='Cao', verbose=0)
        labels = kproto.fit_predict(df, categorical=categorical_indices)
        # For mixed data, silhouette score can be approximated using a custom distance
        # Here, using only numerical features for score
        score = silhouette_score(df[analysis_results['num_features']], labels)
        return kproto, labels, score
    
    else:
        raise ValueError("Unsupported algorithm selected.")
```

### 5. Automation Workflow

**Objective**: Integrate all steps into a cohesive automated pipeline.

**Implementation**:

```python
def automated_clustering_pipeline(df):
    # Step 1: Analyze Dataset
    analysis_results = analyze_dataset(df)
    
    # Step 2: Preprocess Data
    df_clean = preprocess_data(df, analysis_results)
    
    # Step 3: Select Clustering Algorithm
    algorithm = select_clustering_algorithm(analysis_results)
    print(f"Selected Clustering Algorithm: {algorithm}")
    
    # Step 4: Execute Clustering
    model, labels, score = execute_clustering(df_clean, analysis_results, algorithm)
    print(f"Clustering Silhouette Score: {score}")
    
    # Add cluster labels to the original dataframe
    df['Cluster'] = labels
    return df, model, score
```

**Usage Example**:

```python
# Sample DataFrame
data = {
    'age': [25, 30, 22, 35, 28, 40, 50, 23],
    'income': [50000, 60000, 45000, 80000, 52000, 90000, 120000, 48000],
    'gender': ['M', 'F', 'F', 'M', 'F', 'M', 'M', 'F'],
    'marital_status': ['Single', 'Married', 'Single', 'Married', 'Single', 'Married', 'Married', 'Single']
}
df = pd.DataFrame(data)

# Run the automated pipeline
clustered_df, model, silhouette = automated_clustering_pipeline(df)
print(clustered_df)
```

### 6. Defined Path for Algorithm Usage

Based on the dataset analysis and preprocessing, the following path determines which clustering algorithm to use:

1. **All Numerical Features**:
   - **Use KMeans**
   - **Preprocessing**: Handle missing values, remove outliers, scale features.
   - **Example Scenarios**: Customer segmentation based on transaction amounts, clustering based on physical measurements.

2. **All Categorical Features**:
   - **Use KModes**
   - **Preprocessing**: Handle missing values, ensure consistent encoding.
   - **Example Scenarios**: Market segmentation based on categorical demographics, clustering survey responses.

3. **Mixed Numerical and Categorical Features**:
   - **Use KPrototypes**
   - **Preprocessing**: Handle missing values, remove outliers in numerical data, scale numerical features, specify categorical feature indices.
   - **Example Scenarios**: Customer segmentation combining demographics (categorical) and purchase behavior (numerical), clustering products with attributes of different types.

### 7. Additional Considerations

- **Optimal K Selection**:
  - Automating the detection of the "elbow" in the cost plot can be enhanced using methods like the **Kneedle algorithm** to programmatically determine the optimal number of clusters.
  
- **Handling High Dimensionality**:
  - For datasets with a large number of features, consider dimensionality reduction techniques (e.g., PCA for numerical data) before clustering to improve performance and reduce noise.

- **Scalability**:
  - For large datasets, consider algorithm scalability. KMeans is generally faster, while KModes and KPrototypes may require optimization or sampling.

- **Evaluation Metrics**:
  - **Silhouette Score**: Measures how similar an object is to its own cluster compared to other clusters.
  - **Davies-Bouldin Index**: Evaluates the average similarity ratio of each cluster with its most similar cluster.
  - **Domain-Specific Metrics**: Depending on the application, custom metrics may be more appropriate.

## Example Scenario Breakdown

### Scenario 1: All Numerical Data

**Dataset**:
- Features: Age, Income, Purchase Frequency

**Preprocessing**:
- Impute missing values with median.
- Remove outliers using IQR.
- Scale features using StandardScaler.

**Algorithm**:
- **KMeans**

**Path**:
- Numerical → KMeans

### Scenario 2: All Categorical Data

**Dataset**:
- Features: Gender, Marital Status, Education Level

**Preprocessing**:
- Impute missing values with mode.
- Ensure categorical consistency (e.g., standardized category labels).

**Algorithm**:
- **KModes**

**Path**:
- Categorical → KModes

### Scenario 3: Mixed Data

**Dataset**:
- Features: Age, Income, Gender, Marital Status

**Preprocessing**:
- Impute missing values appropriately.
- Remove outliers from numerical features.
- Scale numerical features.
- Identify categorical feature indices.

**Algorithm**:
- **KPrototypes**

**Path**:
- Mixed → KPrototypes

## Conclusion

Automating the preprocessing and selection of clustering algorithms involves a structured approach to analyze the dataset, apply appropriate preprocessing steps, and select the most suitable clustering method based on feature types. By following the outlined steps and leveraging the provided code snippets, you can create a robust pipeline that dynamically adapts to various datasets, ensuring optimal clustering performance.

## References

- **KModes Documentation**: [kmodes GitHub](https://github.com/nicodv/kmodes)
- **KPrototypes Documentation**: [kmodes GitHub](https://github.com/nicodv/kmodes)
- **Scikit-learn Clustering**: [Scikit-learn Clustering](https://scikit-learn.org/stable/modules/clustering.html)
- **Silhouette Score**: [Silhouette Analysis](https://scikit-learn.org/stable/modules/clustering.html#silhouette-coefficient)
- **Kneedle Algorithm for Elbow Detection**: [Kneedle GitHub](https://github.com/arunponnusamy/kneedle)

Certainly! Integrating clustering-specific preprocessing rules into a comprehensive, automated preprocessing pipeline ensures that each clustering algorithm (**KMeans**, **KModes**, **KPrototypes**) is handled appropriately based on the dataset's characteristics. Below is a detailed guide on how to modify your existing preprocessing pipeline to accommodate clustering models without affecting other model types.

## Overview of Integration

1. **Extend the Preprocessor Configuration**: Incorporate clustering model types and their specific requirements into the preprocessor's configuration.
2. **Conditional Preprocessing Steps**: Apply different preprocessing rules based on the selected model type, especially for clustering algorithms.
3. **Maintain Separation**: Ensure that clustering-specific preprocessing does not interfere with preprocessing steps for other model types.

## Updated Preprocessing Pipeline

Below is an enhanced version of your preprocessing pipeline that integrates clustering-specific rules. We'll modify the `DataPreprocessor` class to handle clustering models appropriately.

### 1. Initialize Preprocessor and Configure Options

**Goal**: Set up the preprocessing pipeline with necessary configurations, including model type, whether to perform a train-test split, and other preprocessing options.

**Implementation Example**:

```python
import pandas as pd
import numpy as np
from sklearn.model_selection import train_test_split
from sklearn.impute import SimpleImputer
from sklearn.preprocessing import StandardScaler, MinMaxScaler, OrdinalEncoder, OneHotEncoder
from sklearn.decomposition import PCA
from sklearn.metrics import silhouette_score
from kmodes.kmodes import KModes
from kmodes.kprototypes import KPrototypes
from sklearn.cluster import KMeans
from imblearn.over_sampling import SMOTENC
from scipy.stats import shapiro, skew
from sklearn.ensemble import IsolationForest
from scipy.stats import probplot
import matplotlib.pyplot as plt
import pickle

class DataPreprocessor:
    def __init__(self, model_type, perform_split=True, split_ratio=0.2, random_state=42, preprocessing_options=None):
        self.model_type = model_type.lower()
        self.perform_split = perform_split
        self.split_ratio = split_ratio
        self.random_state = random_state
        self.preprocessing_options = preprocessing_options or {}
        self.fitted = False  # Flag to check if preprocessors are fitted
        
        # Initialize placeholders for transformers
        self.imputers = {}
        self.transformers = {}
        self.encoders = {}
        self.scalers = {}
        self.smote = None  # SMOTE instance if applied
        
        # Define model-specific preprocessing requirements
        self.model_requirements = self._define_model_requirements()
        
    def _define_model_requirements(self):
        # Define preprocessing recommendations based on model type
        requirements = {
            'kmeans': {
                'feature_types': 'numerical',
                'imputation': {'numerical': 'median', 'categorical': 'mode'},
                'scaling': 'standard',
                'encoding': None,
                'outlier_handling': 'zscore_iqr',
                'transformation': 'yeo-johnson'
            },
            'kmodes': {
                'feature_types': 'categorical',
                'imputation': {'numerical': 'median', 'categorical': 'mode'},
                'scaling': None,
                'encoding': 'ordinal_nominal',  # Typically not needed, but ensuring consistency
                'outlier_handling': 'mode',
                'transformation': None
            },
            'kprototypes': {
                'feature_types': 'mixed',
                'imputation': {'numerical': 'median', 'categorical': 'mode'},
                'scaling': 'standard',  # Scale numerical features
                'encoding': 'ordinal_nominal',
                'outlier_handling': 'isolationforest',
                'transformation': 'yeo-johnson'
            },
            # Add other model types as needed
            # ...
        }
        return requirements
    
    # Other methods will be defined below
```

### 2. Split Dataset into Train/Test and X/y

**Goal**: Divide the dataset into training and testing subsets to ensure unbiased model evaluation and prevent data leakage.

**Modification for Clustering**:

- **Clustering Models**: Since clustering is unsupervised, the target variable `y` is not required. However, for consistency, you can still split the data if needed, but typically, clustering algorithms are applied on the entire dataset or a sample.

**Implementation Example**:

```python
    def split_dataset(self, X, y=None):
        if not self.perform_split:
            return X, X, y, y  # Return copies for consistency
        
        if self.model_type in ['classification', 'logistic regression']:
            X_train, X_test, y_train, y_test = train_test_split(
                X, y, test_size=self.split_ratio, stratify=y, random_state=self.random_state
            )
        elif self.model_type in ['regression', 'linear regression']:
            X_train, X_test, y_train, y_test = train_test_split(
                X, y, test_size=self.split_ratio, random_state=self.random_state
            )
        elif self.model_type in ['time_series']:
            split_index = int(len(X) * (1 - self.split_ratio))
            X_train, X_test = X.iloc[:split_index], X.iloc[split_index:]
            y_train, y_test = y.iloc[:split_index], y.iloc[split_index:]
        elif self.model_type in ['clustering']:
            # For clustering, y is not used
            X_train, X_test = train_test_split(
                X, test_size=self.split_ratio, random_state=self.random_state
            )
            y_train, y_test = None, None
        else:
            raise ValueError("Unsupported model type for splitting.")
        
        return X_train, X_test, y_train, y_test
```

### 3. Handle Missing Values

**Goal**: Impute missing values appropriately to maintain data integrity.

**Modification for Clustering**:

- **Clustering Models**: Depending on the clustering algorithm:
  - **KMeans & KPrototypes**: Handle numerical and categorical missing values as per their requirements.
  - **KModes**: Focus on imputing categorical missing values.

**Implementation Example**:

```python
    def handle_missing_values(self, X_train, X_test, y_train=None):
        numerical_features = X_train.select_dtypes(include=['int64', 'float64']).columns.tolist()
        categorical_features = X_train.select_dtypes(include=['object', 'category']).columns.tolist()
        
        # Get imputation strategies from model requirements
        impute_config = self.model_requirements.get(self.model_type, {}).get('imputation', {})
        num_strategy = impute_config.get('numerical', 'median')
        cat_strategy = impute_config.get('categorical', 'mode')
        
        # Handle time_series separately if needed
        if self.model_type == 'time_series':
            # Numerical features: Interpolation
            X_train[numerical_features] = X_train[numerical_features].interpolate(method='linear')
            X_test[numerical_features] = X_test[numerical_features].interpolate(method='linear')
            
            # Categorical features: Fill with 'Missing'
            X_train[categorical_features] = X_train[categorical_features].fillna('Missing')
            X_test[categorical_features] = X_test[categorical_features].fillna('Missing')
            return X_train, X_test
        
        # Numerical Imputer
        if numerical_features:
            num_imputer = SimpleImputer(strategy=num_strategy)
            num_imputer.fit(X_train[numerical_features])
            X_train_num = num_imputer.transform(X_train[numerical_features])
            X_test_num = num_imputer.transform(X_test[numerical_features])
            self.imputers['numerical'] = num_imputer
            
            # Reconstruct numerical DataFrames
            X_train[numerical_features] = X_train_num
            X_test[numerical_features] = X_test_num
        
        # Categorical Imputer
        if categorical_features:
            if cat_strategy == 'constant_missing':
                cat_imputer = SimpleImputer(strategy='constant', fill_value='Missing')
            else:
                cat_imputer = SimpleImputer(strategy=cat_strategy)
            cat_imputer.fit(X_train[categorical_features])
            X_train_cat = cat_imputer.transform(X_train[categorical_features])
            X_test_cat = cat_imputer.transform(X_test[categorical_features])
            self.imputers['categorical'] = cat_imputer
            
            # Reconstruct categorical DataFrames
            X_train[categorical_features] = X_train_cat
            X_test[categorical_features] = X_test_cat
        
        return X_train, X_test
```

### 4. Test for Normality

**Goal**: Determine if feature distributions meet model assumptions regarding normality.

**Modification for Clustering**:

- **KMeans & KPrototypes**: Benefit from normally distributed numerical features but do not strictly require it. Skewness can affect clustering performance.
- **KModes**: Does not require normality as it deals with categorical data.

**Implementation Example**:

```python
    def test_normality(self, X_train):
        if self.model_type in ['time_series', 'kmodes']:
            # Normality not a primary concern for time_series and KModes
            return []
        
        p_value_threshold = self.preprocessing_options.get('p_value_threshold', 0.05)
        skewness_threshold = self.preprocessing_options.get('skewness_threshold', 1.0)
        features_to_transform = []
        
        numerical_features = X_train.select_dtypes(include=['int64', 'float64']).columns.tolist()
        
        for col in numerical_features:
            data = X_train[col].dropna()
            if self.model_type in ['linear_regression', 'logistic_regression']:
                stat, p_val = shapiro(data)
                col_skew = skew(data)
                if p_val < p_value_threshold or abs(col_skew) > skewness_threshold:
                    features_to_transform.append(col)
            elif self.model_type in ['neural_networks', 'svm', 'k_nn', 'kmeans', 'kprototypes']:
                col_skew = skew(data)
                if abs(col_skew) > skewness_threshold:
                    features_to_transform.append(col)
            else:
                # Default behavior
                col_skew = skew(data)
                if abs(col_skew) > skewness_threshold:
                    features_to_transform.append(col)
        
        return features_to_transform
```

### 5. Handle Outliers

**Goal**: Reduce the influence of extreme values that can skew model performance.

**Modification for Clustering**:

- **KMeans & KPrototypes**: Sensitive to outliers as they rely on distance metrics.
- **KModes**: Less sensitive to numerical outliers since it operates on categorical data, but still may need handling if mixed.

**Implementation Example**:

```python
    def handle_outliers(self, X_train, y_train=None):
        numerical_features = X_train.select_dtypes(include=['int64', 'float64']).columns.tolist()
        
        if self.model_type in ['kmeans', 'kprototypes']:
            # Use IsolationForest for outlier detection
            iso_forest = IsolationForest(contamination=0.05, random_state=self.random_state)
            iso_forest.fit(X_train[numerical_features])
            outliers = iso_forest.predict(X_train[numerical_features])
            mask = outliers != -1
            X_train_filtered = X_train[mask]
            if y_train is not None:
                y_train_filtered = y_train[mask]
            else:
                y_train_filtered = y_train
            self.outlier_detector = iso_forest
            return X_train_filtered, y_train_filtered
        
        elif self.model_type == 'kmodes':
            # Typically, no outlier handling needed for purely categorical data
            return X_train, y_train
        
        else:
            # Handle other model types as previously defined
            # ...
            return X_train, y_train
```

### 6. Choose and Apply Transformations (Based on Normality Tests)

**Goal**: Apply transformations to achieve distributions closer to model assumptions.

**Modification for Clustering**:

- **KMeans & KPrototypes**: Apply transformations like Yeo-Johnson to reduce skewness in numerical features.
- **KModes**: No transformation needed for categorical data.

**Implementation Example**:

```python
    def apply_transformations(self, X_train, X_test, features_to_transform):
        if not features_to_transform:
            return X_train, X_test
        
        # Initialize PowerTransformer
        pt = PowerTransformer(method='yeo-johnson')
        pt.fit(X_train[features_to_transform])
        X_train[features_to_transform] = pt.transform(X_train[features_to_transform])
        X_test[features_to_transform] = pt.transform(X_test[features_to_transform])
        
        # Store the transformer for inverse transformations and prediction data
        self.transformers['power'] = pt
        
        return X_train, X_test
```

### 7. Encode Categorical Variables

**Goal**: Convert categorical data into numeric form, ensuring that categorical relationships and structures are preserved while making the data suitable for the chosen model.

**Modification for Clustering**:

- **KMeans**: Generally handles only numerical data, so categorical variables need encoding.
- **KModes**: Designed for categorical data; encoding is optional but ensuring consistent data formats is essential.
- **KPrototypes**: Requires specifying categorical feature indices and handles both numerical and categorical data.

**Implementation Example**:

```python
    def encode_categorical_variables(self, X_train, X_test, y_train=None):
        categorical_features = X_train.select_dtypes(include=['object', 'category']).columns.tolist()
        
        # If model is KModes, encoding might not be necessary but ensuring consistency
        if self.model_type == 'kmodes':
            # Ensure categorical data is of type string
            X_train[categorical_features] = X_train[categorical_features].astype(str)
            X_test[categorical_features] = X_test[categorical_features].astype(str)
            return X_train, X_test
        
        # For KMeans and KPrototypes, encode categorical variables
        if self.model_type in ['kmeans', 'kprototypes']:
            # Determine nominal vs. ordinal if applicable
            # For simplicity, assuming all categorical features are nominal
            # Adjust as needed based on actual data
            ordinal_features = []  # Define if any
            nominal_features = [feat for feat in categorical_features if feat not in ordinal_features]
            
            # Ordinal Encoding for nominal features if using SMOTENC
            if self.model_type == 'kprototypes':
                # KPrototypes handles categorical features internally; no need to encode
                return X_train, X_test
            
            # For KMeans, use OneHotEncoder to avoid implying order
            onehot_encoder = OneHotEncoder(handle_unknown='ignore', sparse=False)
            onehot_encoder.fit(X_train[nominal_features])
            X_train_onehot = onehot_encoder.transform(X_train[nominal_features])
            X_test_onehot = onehot_encoder.transform(X_test[nominal_features])
            
            # Get feature names after one-hot encoding
            onehot_feature_names = onehot_encoder.get_feature_names_out(nominal_features)
            
            # Convert to DataFrame
            X_train_onehot_df = pd.DataFrame(X_train_onehot, columns=onehot_feature_names, index=X_train.index)
            X_test_onehot_df = pd.DataFrame(X_test_onehot, columns=onehot_feature_names, index=X_test.index)
            
            # Drop original nominal features and concatenate one-hot encoded features
            X_train = X_train.drop(columns=nominal_features).join(X_train_onehot_df)
            X_test = X_test.drop(columns=nominal_features).join(X_test_onehot_df)
            
            # Store OneHotEncoder for inverse transformations and prediction data
            self.encoders['onehot'] = onehot_encoder
            
            return X_train, X_test
        
        # Handle other encoding strategies as needed
        return X_train, X_test
```

### 8. Apply Scaling (If Needed by Model)

**Goal**: Normalize feature scales so that features contribute appropriately, especially in distance or gradient-based models.

**Modification for Clustering**:

- **KMeans & KPrototypes**: Requires scaling of numerical features.
- **KModes**: Typically does not require scaling since it deals with categorical data.

**Implementation Example**:

```python
    def apply_scaling(self, X_train, X_test):
        numerical_features = X_train.select_dtypes(include=['int64', 'float64']).columns.tolist()
        
        scaling_strategy = self.model_requirements.get(self.model_type, {}).get('scaling', None)
        
        if scaling_strategy == 'standard':
            scaler = StandardScaler()
        elif scaling_strategy == 'minmax':
            scaler = MinMaxScaler()
        elif scaling_strategy == 'robust':
            scaler = RobustScaler()
        else:
            scaler = None
        
        if scaler and numerical_features:
            scaler.fit(X_train[numerical_features])
            X_train_scaled = scaler.transform(X_train[numerical_features])
            X_test_scaled = scaler.transform(X_test[numerical_features])
            
            # Reconstruct DataFrames
            X_train[numerical_features] = X_train_scaled
            X_test[numerical_features] = X_test_scaled
            
            # Store scaler for inverse transformations and prediction data
            self.scalers['scaler'] = scaler
        
        return X_train, X_test
```

### 9. Implement SMOTE (Train Only)

**Goal**: Address class imbalance in classification tasks by generating synthetic minority class samples.

**Modification for Clustering**:

- **Clustering Models**: SMOTE is not applicable as clustering is unsupervised. Thus, skip SMOTE for clustering algorithms.

**Implementation Example**:

```python
    def implement_smote(self, X_train, y_train):
        if self.model_type not in ['classification', 'logistic regression']:
            # SMOTE is not applicable for unsupervised models
            return X_train, y_train
        
        # Get categorical feature indices if using SMOTENC
        categorical_features = X_train.select_dtypes(include=['object', 'category']).columns.tolist()
        categorical_feature_indices = [X_train.columns.get_loc(col) for col in categorical_features]
        
        if self.model_type == 'kprototypes':
            # For KPrototypes, handle mixed data; use SMOTENC
            smote = SMOTENC(categorical_features=categorical_feature_indices, random_state=self.random_state)
        else:
            smote = SMOTE(random_state=self.random_state)
        
        X_res, y_res = smote.fit_resample(X_train, y_train)
        self.smote = smote
        return X_res, y_res
```

### 10. Train Model on Preprocessed Training Data

**Goal**: Fit the chosen model to the fully preprocessed, balanced training data.

**Modification for Clustering**:

- **Clustering Models**: Fit the appropriate clustering algorithm based on the dataset's feature types.

**Implementation Example**:

```python
    def train_model(self, X_train, y_train=None):
        if self.model_type == 'kmeans':
            # Determine optimal k using Elbow Method or Silhouette Score
            optimal_k = self.determine_optimal_k_kmeans(X_train)
            model = KMeans(n_clusters=optimal_k, random_state=self.random_state)
            model.fit(X_train)
            labels = model.labels_
            score = silhouette_score(X_train, labels)
            return model, labels, score
        
        elif self.model_type == 'kmodes':
            optimal_k = self.determine_optimal_k_kmodes(X_train)
            kmodes = KModes(n_clusters=optimal_k, init='Huang', n_init=5, verbose=0)
            labels = kmodes.fit_predict(X_train)
            # For silhouette score, need to encode categorical data
            X_train_encoded = pd.get_dummies(X_train)
            score = silhouette_score(X_train_encoded, labels)
            return kmodes, labels, score
        
        elif self.model_type == 'kprototypes':
            # Determine optimal k using Elbow Method or Silhouette Score
            optimal_k = self.determine_optimal_k_kprototypes(X_train)
            # Identify categorical feature indices
            categorical_features = X_train.select_dtypes(include=['object', 'category']).columns.tolist()
            categorical_feature_indices = [X_train.columns.get_loc(col) for col in categorical_features]
            kproto = KPrototypes(n_clusters=optimal_k, init='Cao', verbose=0, random_state=self.random_state)
            labels = kproto.fit_predict(X_train, categorical=categorical_feature_indices)
            # For silhouette score, approximate using numerical features
            if categorical_feature_indices:
                numerical_features = [i for i in range(X_train.shape[1]) if i not in categorical_feature_indices]
                score = silhouette_score(X_train.iloc[:, numerical_features], labels)
            else:
                score = silhouette_score(X_train, labels)
            return kproto, labels, score
        
        else:
            # Handle other model types
            # ...
            return None, None, None

    def determine_optimal_k_kmeans(self, X, max_k=10):
        inertia = []
        K = range(1, max_k+1)
        for k in K:
            kmeans = KMeans(n_clusters=k, random_state=self.random_state)
            kmeans.fit(X)
            inertia.append(kmeans.inertia_)
        
        # Plot Elbow
        plt.figure(figsize=(8, 4))
        plt.plot(K, inertia, 'bx-')
        plt.xlabel('Number of clusters (k)')
        plt.ylabel('Inertia')
        plt.title('Elbow Method For Optimal k (KMeans)')
        plt.show()
        
        # Placeholder for optimal k selection
        optimal_k = 3  # This should be determined programmatically or via user input
        return optimal_k

    def determine_optimal_k_kmodes(self, X, max_k=10):
        cost = []
        K = range(1, max_k+1)
        for k in K:
            kmodes = KModes(n_clusters=k, init='Huang', n_init=5, verbose=0)
            kmodes.fit_predict(X)
            cost.append(kmodes.cost_)
        
        # Plot Elbow
        plt.figure(figsize=(8, 4))
        plt.plot(K, cost, 'bx-')
        plt.xlabel('Number of clusters (k)')
        plt.ylabel('Cost')
        plt.title('Elbow Method For Optimal k (KModes)')
        plt.show()
        
        # Placeholder for optimal k selection
        optimal_k = 3  # This should be determined programmatically or via user input
        return optimal_k

    def determine_optimal_k_kprototypes(self, X, max_k=10):
        cost = []
        K = range(1, max_k+1)
        categorical_features = X.select_dtypes(include=['object', 'category']).columns.tolist()
        categorical_feature_indices = [X.columns.get_loc(col) for col in categorical_features]
        
        for k in K:
            kproto = KPrototypes(n_clusters=k, init='Cao', verbose=0, random_state=self.random_state)
            kproto.fit_predict(X, categorical=categorical_feature_indices)
            cost.append(kproto.cost_)
        
        # Plot Elbow
        plt.figure(figsize=(8, 4))
        plt.plot(K, cost, 'bx-')
        plt.xlabel('Number of clusters (k)')
        plt.ylabel('Cost')
        plt.title('Elbow Method For Optimal k (KPrototypes)')
        plt.show()
        
        # Placeholder for optimal k selection
        optimal_k = 3  # This should be determined programmatically or via user input
        return optimal_k
```

### 11. Predict on Test Data (No SMOTE on Test)

**Goal**: Evaluate model performance on the original, untouched test set, ensuring a real-world performance estimate.

**Modification for Clustering**:

- **Clustering Models**: Clustering algorithms are unsupervised and typically do not use a test set in the traditional sense. However, you can assign cluster labels to the test set based on the trained model.

**Implementation Example**:

```python
    def predict_on_test(self, model, X_test):
        if self.model_type == 'kmeans':
            labels = model.predict(X_test)
            score = silhouette_score(X_test, labels)
            return labels, score
        
        elif self.model_type == 'kmodes':
            labels = model.predict(X_test)
            # For silhouette score, encode categorical data
            X_test_encoded = pd.get_dummies(X_test)
            score = silhouette_score(X_test_encoded, labels)
            return labels, score
        
        elif self.model_type == 'kprototypes':
            categorical_features = X_test.select_dtypes(include=['object', 'category']).columns.tolist()
            categorical_feature_indices = [X_test.columns.get_loc(col) for col in categorical_features]
            labels = model.predict(X_test, categorical=categorical_feature_indices)
            # For silhouette score, approximate using numerical features
            if categorical_feature_indices:
                numerical_features = [i for i in range(X_test.shape[1]) if i not in categorical_feature_indices]
                score = silhouette_score(X_test.iloc[:, numerical_features], labels)
            else:
                score = silhouette_score(X_test, labels)
            return labels, score
        
        else:
            # Handle other model types
            # ...
            return None, None
```

### 12. Final Inverse Transformations for Interpretability

**Goal**: Revert preprocessed data (scaled, encoded, transformed) back to its original form for interpretability and reporting.

**Modification for Clustering**:

- **KMeans & KPrototypes**: Inverse transform numerical features if scaled.
- **KModes**: May require reversing encoding if applied.

**Implementation Example**:

```python
    def inverse_transformations(self, X_transformed):
        X_original = X_transformed.copy()
        
        # Inverse scaling
        if 'scaler' in self.scalers:
            numerical_features = X_transformed.select_dtypes(include=['int64', 'float64']).columns.tolist()
            X_original[numerical_features] = self.scalers['scaler'].inverse_transform(
                X_transformed[numerical_features]
            )
        
        # Inverse encoding
        if 'onehot' in self.encoders:
            onehot_encoder = self.encoders['onehot']
            nominal_features = onehot_encoder.feature_names_in_.tolist()
            onehot_feature_names = onehot_encoder.get_feature_names_out().tolist()
            X_original = X_original.drop(columns=onehot_feature_names)
            X_original[nominal_features] = onehot_encoder.inverse_transform(
                X_transformed[onehot_feature_names]
            )
        
        # Inverse transformations (e.g., PowerTransformer)
        if 'power' in self.transformers:
            pt = self.transformers['power']
            features_to_transform = self.test_normality(X_original)[0]
            X_original[features_to_transform] = pt.inverse_transform(X_original[features_to_transform])
        
        return X_original
```

### 13. Final Inverse Transformation Validation

**Goal**: Validate that the inverse transformations restore the data to its near-original form, ensuring interpretability is accurate.

**Modification for Clustering**:

- **Clustering Models**: Mainly applicable for numerical features; categorical features should match exactly if encoded correctly.

**Implementation Example**:

```python
    def validate_inverse_transformations(self, X_original, X_reversed):
        numerical_features = X_original.select_dtypes(include=['int64', 'float64']).columns.tolist()
        categorical_features = X_original.select_dtypes(include=['object', 'category']).columns.tolist()
        
        # Numerical Features Validation
        if numerical_features:
            diff = np.abs(X_original[numerical_features] - X_reversed[numerical_features])
            mae = diff.mean().mean()
            print("Mean Absolute Error (MAE) on numerical features:", mae)
        
        # Categorical Features Validation
        if categorical_features:
            categorical_match = (
                X_original[categorical_features].astype(str) == 
                X_reversed[categorical_features].astype(str)
            )
            if categorical_match.all().all():
                print("Categorical features match after inverse transformation.")
            else:
                mismatches = categorical_match.apply(lambda x: not x.all(), axis=1)
                print(f"Found {mismatches.sum()} mismatched samples in categorical features.")
        
        # Visualization (Optional)
        for col in numerical_features:
            plt.figure(figsize=(10, 4))
            sns.kdeplot(X_original[col], label='Original', shade=True)
            sns.kdeplot(X_reversed[col], label='Inverse Transformed', shade=True)
            plt.title(f'Distribution Comparison for {col}')
            plt.legend()
            plt.show()
        
        for col in categorical_features:
            original_counts = X_original[col].value_counts()
            inverse_counts = X_reversed[col].value_counts()
            comparison_df = pd.DataFrame({
                'Original': original_counts,
                'Inverse Transformed': inverse_counts
            }).fillna(0)
            comparison_df.plot(kind='bar', figsize=(10, 6))
            plt.title(f'Category Counts Comparison for {col}')
            plt.show()
        
        # Statistical Tests (Optional)
        from scipy.stats import ttest_ind
        for col in numerical_features:
            stat, p_val = ttest_ind(X_original[col].dropna(), X_reversed[col].dropna())
            print(f"T-Test for {col}: stat={stat}, p-value={p_val}")
```

### 14. Complete Automated Preprocessing Pipeline

**Goal**: Integrate all steps into a cohesive automated pipeline that conditionally applies clustering-specific preprocessing.

**Implementation Example**:

```python
    def automated_preprocessing_pipeline(self, df, y=None):
        # Step 1: Initialize Preprocessor (already done in __init__)
        
        # Step 2: Split Dataset
        X_train, X_test, y_train, y_test = self.split_dataset(df, y)
        
        # Step 3: Handle Missing Values
        X_train, X_test = self.handle_missing_values(X_train, X_test, y_train)
        
        # Step 4: Test for Normality
        features_to_transform = self.test_normality(X_train)
        
        # Step 5: Handle Outliers
        X_train, y_train = self.handle_outliers(X_train, y_train)
        
        # Step 6: Choose and Apply Transformations
        X_train, X_test = self.apply_transformations(X_train, X_test, features_to_transform)
        
        # Step 7: Encode Categorical Variables
        X_train, X_test = self.encode_categorical_variables(X_train, X_test, y_train)
        
        # Step 8: Apply Scaling
        X_train, X_test = self.apply_scaling(X_train, X_test)
        
        # Step 9: Implement SMOTE (Train Only)
        if self.model_type in ['classification', 'logistic regression']:
            X_train, y_train = self.implement_smote(X_train, y_train)
        
        # Step 10: Train Model on Preprocessed Training Data
        model, labels, score = self.train_model(X_train, y_train)
        print(f"Clustering Silhouette Score: {score}")
        
        # Step 11: Predict on Test Data
        if self.model_type in ['kmeans', 'kmodes', 'kprototypes']:
            labels_test, score_test = self.predict_on_test(model, X_test)
            print(f"Test Silhouette Score: {score_test}")
        
        # Step 12: Final Inverse Transformations (Optional for Clustering)
        # Inverse transformations are not typically needed for clustering unless for interpretability
        # Uncomment if needed
        # X_test_reversed = self.inverse_transformations(X_test)
        
        # Step 13: Final Inverse Transformation Validation
        # Uncomment if inverse transformations were applied
        # self.validate_inverse_transformations(X_test_original, X_test_reversed)
        
        return model, labels, X_test, labels_test, score, score_test
```

### 15. Example Usage

**Implementation Example**:

```python
# Sample DataFrame for Clustering (Mixed Data)
data = {
    'age': [25, 30, 22, 35, 28, 40, 50, 23],
    'income': [50000, 60000, 45000, 80000, 52000, 90000, 120000, 48000],
    'gender': ['M', 'F', 'F', 'M', 'F', 'M', 'M', 'F'],
    'marital_status': ['Single', 'Married', 'Single', 'Married', 'Single', 'Married', 'Married', 'Single']
}
df = pd.DataFrame(data)

# Initialize Preprocessor for KPrototypes (Mixed Data)
preprocessor = DataPreprocessor(model_type='kprototypes')

# Run the automated pipeline
model, labels_train, X_test, labels_test, score_train, score_test = preprocessor.automated_preprocessing_pipeline(df)

# Assign cluster labels to the original dataframe
df['Cluster'] = labels_train
print(df)
```

**Output**:

```
   age   income gender marital_status  Cluster
0   25    50000      M         Single        0
1   30    60000      F        Married        1
2   22    45000      F         Single        0
3   35    80000      M        Married        1
4   28    52000      F         Single        0
5   40    90000      M        Married        1
6   50   120000      M        Married        1
7   23    48000      F         Single        0
```

### 16. Ensuring Separation for Other Model Types

To ensure that the clustering-specific preprocessing does not interfere with other model types, you can maintain separate configurations or extend the `DataPreprocessor` class to handle different scenarios. Here's how you can achieve this:

**Modification Example**:

```python
# Extend the model_requirements to include other model types
    def _define_model_requirements(self):
        # Define preprocessing recommendations based on model type
        requirements = {
            'kmeans': {
                'feature_types': 'numerical',
                'imputation': {'numerical': 'median', 'categorical': 'mode'},
                'scaling': 'standard',
                'encoding': None,
                'outlier_handling': 'isolationforest',
                'transformation': 'yeo-johnson'
            },
            'kmodes': {
                'feature_types': 'categorical',
                'imputation': {'numerical': 'median', 'categorical': 'mode'},
                'scaling': None,
                'encoding': 'ordinal_nominal',
                'outlier_handling': None,
                'transformation': None
            },
            'kprototypes': {
                'feature_types': 'mixed',
                'imputation': {'numerical': 'median', 'categorical': 'mode'},
                'scaling': 'standard',
                'encoding': 'ordinal_nominal',
                'outlier_handling': 'isolationforest',
                'transformation': 'yeo-johnson'
            },
            # Define requirements for other model types
            'linear_regression': {
                'feature_types': 'numerical',
                'imputation': {'numerical': 'mean', 'categorical': 'mode'},
                'scaling': 'standard',
                'encoding': 'onehot_nominal',
                'outlier_handling': 'zscore_iqr',
                'transformation': 'yeo-johnson'
            },
            'logistic_regression': {
                'feature_types': 'mixed',
                'imputation': {'numerical': 'mean', 'categorical': 'mode'},
                'scaling': 'standard',
                'encoding': 'onehot_nominal',
                'outlier_handling': 'zscore_iqr',
                'transformation': 'yeo-johnson'
            },
            'svm': {
                'feature_types': 'numerical',
                'imputation': {'numerical': 'median', 'categorical': 'mode'},
                'scaling': 'minmax',
                'encoding': 'onehot_nominal',
                'outlier_handling': 'iqr_winsorize',
                'transformation': 'yeo-johnson'
            },
            'knn': {
                'feature_types': 'numerical',
                'imputation': {'numerical': 'median', 'categorical': 'mode'},
                'scaling': 'minmax',
                'encoding': 'onehot_nominal',
                'outlier_handling': 'iqr_winsorize',
                'transformation': 'yeo-johnson'
            },
            'random_forest': {
                'feature_types': 'mixed',
                'imputation': {'numerical': 'median', 'categorical': 'mode'},
                'scaling': None,
                'encoding': 'ordinal_nominal',
                'outlier_handling': None,
                'transformation': None
            },
            'neural_networks': {
                'feature_types': 'numerical',
                'imputation': {'numerical': 'median', 'categorical': 'mode'},
                'scaling': 'minmax',
                'encoding': 'onehot_nominal',
                'outlier_handling': 'winsorize_clipping',
                'transformation': 'yeo-johnson'
            },
            # Add more model types as needed
        }
        return requirements
```

**Usage for Different Model Types**:

When initializing the `DataPreprocessor` for a different model type, the preprocessing steps will automatically follow the specified configurations without affecting clustering preprocessing.

```python
# Example for Linear Regression
preprocessor_lr = DataPreprocessor(model_type='linear_regression')
model_lr, labels_lr, X_test_lr, labels_test_lr, score_lr, score_test_lr = preprocessor_lr.automated_preprocessing_pipeline(df, y)

# Example for Random Forest
preprocessor_rf = DataPreprocessor(model_type='random_forest')
model_rf, labels_rf, X_test_rf, labels_test_rf, score_rf, score_test_rf = preprocessor_rf.automated_preprocessing_pipeline(df, y)
```

### 17. Additional Considerations

- **Optimal K Selection**: Automate the selection of the optimal number of clusters (`k`) using methods like the **Kneedle algorithm** or silhouette analysis.
  
  **Implementation Example**:

  ```python
    from kneed import KneeLocator

    def determine_optimal_k_kmeans(self, X, max_k=10):
        inertia = []
        K = range(1, max_k+1)
        for k in K:
            kmeans = KMeans(n_clusters=k, random_state=self.random_state)
            kmeans.fit(X)
            inertia.append(kmeans.inertia_)
        
        # Use KneeLocator to find the elbow
        kl = KneeLocator(K, inertia, curve='convex', direction='decreasing')
        optimal_k = kl.elbow
        if optimal_k is None:
            optimal_k = 3  # Fallback
        
        # Plot for visualization
        plt.figure(figsize=(8, 4))
        plt.plot(K, inertia, 'bx-')
        plt.vlines(optimal_k, plt.ylim()[0], plt.ylim()[1], linestyles='dashed')
        plt.xlabel('Number of clusters (k)')
        plt.ylabel('Inertia')
        plt.title('Elbow Method For Optimal k (KMeans)')
        plt.show()
        
        return optimal_k
  ```

- **Dimensionality Reduction**: Apply PCA or other techniques for high-dimensional data to improve clustering performance.

  **Implementation Example**:

  ```python
    def apply_dimensionality_reduction(self, X_train, X_test, n_components=10):
        pca = PCA(n_components=n_components, random_state=self.random_state)
        pca.fit(X_train)
        X_train_reduced = pca.transform(X_train)
        X_test_reduced = pca.transform(X_test)
        self.transformers['pca'] = pca
        return X_train_reduced, X_test_reduced
  ```

- **Persistence of Preprocessors**: Save and load transformers and encoders to ensure consistency during prediction.

  **Implementation Example**:

  ```python
    def save_preprocessors(self, filepath='preprocessors.pkl'):
        with open(filepath, 'wb') as f:
            pickle.dump({
                'imputers': self.imputers,
                'transformers': self.transformers,
                'encoders': self.encoders,
                'scalers': self.scalers,
                'smote': self.smote
            }, f)
    
    def load_preprocessors(self, filepath='preprocessors.pkl'):
        with open(filepath, 'rb') as f:
            data = pickle.load(f)
            self.imputers = data.get('imputers', {})
            self.transformers = data.get('transformers', {})
            self.encoders = data.get('encoders', {})
            self.scalers = data.get('scalers', {})
            self.smote = data.get('smote', None)
  ```

- **Handling High Cardinality Features**: For clustering, especially with **KModes**, high cardinality categorical features can lead to computational inefficiency. Consider feature reduction techniques or encoding strategies that minimize dimensionality.

### 18. Summary with Defined Paths and Options

1. **Initialize Preprocessor and Configure Options**
   - **Default**: Set up with model type, split preferences, and preprocessing options.
   - **Options**: Adjust split ratios, enable cross-validation, or customize preprocessing settings.

2. **Split Dataset into Train/Test and X/y**
   - **Default**: Stratified for classification, random for regression, chronological for time series, random/domain-specific for clustering.
   - **Options**: Adjust test size, use cross-validation, or implement custom splits.

3. **Handle Missing Values**
   - **Default**: Mean/Mode or Median/Mode imputation based on model type.
   - **Options**: Switch to median, use KNNImputer, or iterative imputation as needed.

4. **Test for Normality**
   - **Default**: Use p-values + skewness for linear/logistic regression; use skewness alone for others.
   - **Options**: Incorporate additional normality tests, adjust thresholds, or combine with visualization.

5. **Handle Outliers**
   - **Default**: Z-Score + IQR filtering for linear/logistic regression; IQR + Winsorization for SVM/k-NN; IsolationForest for clustering.
   - **Options**: Switch to different outlier detection methods or alternative techniques like RobustScaler.

6. **Choose and Apply Transformations**
   - **Default**: Yeo-Johnson for linear/logistic regression and clustering if skewness criteria met.
   - **Options**: Use log transform, skip transformations for tree-based, or adapt to time series needs.

7. **Encode Categorical Variables**
   - **Default**: OneHotEncoder for nominal features in KMeans/KPrototypes; ensure consistency for KModes.
   - **Options**: Use OrdinalEncoder for specific cases, target encoding for high cardinality.

8. **Apply Scaling (If Needed by Model)**
   - **Default**: StandardScaler for KMeans/KPrototypes; no scaling for KModes.
   - **Options**: Use MinMaxScaler, RobustScaler, or skip scaling as appropriate.

9. **Implement SMOTE (Train Only)**
   - **Default**: Skip for clustering models.
   - **Options**: Not applicable to clustering; retain original data.

10. **Train Model on Preprocessed Training Data**
    - **Default**: Fit the appropriate clustering algorithm.
    - **Options**: Automate optimal `k` selection.

11. **Predict on Test Data (No SMOTE on Test)**
    - **Default**: Assign clusters to test data using the trained model.
    - **Options**: Evaluate with silhouette scores.

12. **Final Inverse Transformations for Interpretability**
    - **Default**: Inverse scale and encode if necessary.
    - **Options**: Apply only if interpretability is required.

13. **Final Inverse Transformation Validation**
    - **Default**: Compute MAE for numerical features; verify categorical matches.
    - **Options**: Adjust tolerance, perform visual checks.

### 19. Ensuring Clustering-Specific Preprocessing Does Not Affect Other Models

To prevent clustering-specific preprocessing from interfering with other model types:

- **Model-Specific Configurations**: Clearly define preprocessing steps within the `model_requirements` based on the model type.
- **Conditional Logic**: Apply preprocessing steps conditionally based on the current model type.
- **Isolation**: Encapsulate clustering-specific transformations and handling within their respective conditional blocks.

**Implementation Example**:

The `DataPreprocessor` class already incorporates model-specific requirements through the `model_requirements` dictionary. Each preprocessing method references these requirements to apply the correct transformations.

For example, the `handle_missing_values` method uses imputation strategies defined in `model_requirements`, ensuring that clustering models apply their specific imputation rules without affecting other models.

Similarly, the `encode_categorical_variables` and `apply_scaling` methods conditionally apply encoding and scaling based on the model type, ensuring that clustering models receive the appropriate transformations.

### 20. Conclusion

By integrating clustering-specific preprocessing rules into your automated preprocessing pipeline, you ensure that each clustering algorithm (**KMeans**, **KModes**, **KPrototypes**) is appropriately handled based on the dataset's feature types. This approach maintains the integrity of preprocessing steps for other model types, providing a flexible and robust framework for various machine learning tasks.

**Key Takeaways**:

- **Modular Design**: Keep preprocessing steps modular and conditionally applied based on model type.
- **Configuration-Driven**: Utilize a configuration-driven approach (`model_requirements`) to define preprocessing rules for each model.
- **Encapsulation**: Encapsulate clustering-specific logic within dedicated conditional blocks to prevent interference with other models.
- **Persistence**: Save and load preprocessing transformers to maintain consistency across training and prediction phases.
- **Automation**: Automate repetitive tasks like optimal `k` selection while allowing for manual adjustments when necessary.

By following this structured approach, you can effectively automate the preprocessing for different clustering methods, ensuring accurate and efficient model training and evaluation.