[![Open in Colab](https://colab.research.google.com/assets/colab-badge.svg)](https://drive.google.com/file/d/1i-KRyXrMJtAuS7K0uSMNxcgyXWs3t6h7/view?usp=sharing)

#**Imputation Techniques for Missing Data Handling**

# **`1.` Introduction**

This notebook is focused on handling missing data in a dataset using various imputation techniques. In particular, both standard imputation methods and group-based imputation strategies are employed to fill in missing values. The main objectives are:

- **Identifying Missing Values:** Understanding the scope and nature of missing data in both numerical and categorical features.
- **Choosing Imputation Methods:** Applying different imputation techniques, including group-based mean/mode strategies and specific imputer objects.
- **Combining Techniques:** Utilizing a flexible approach to handle missing values in a way that maximizes data quality for downstream analysis.
- **Evaluation of Imputation Impact:** Comparing the effects of different imputation strategies on model performance and overall data integrity.

**Data Preprocessing**

Before testing each combination of imputers on the model, several preprocessing steps were implemented to ensure the dataset was properly prepared for analysis.
1. **Encode Categorical Features:** Categorical variables were encoded using techniques such as classify_cardinality or direct dummy encoding, as appropriate.
2. **Replace Missing Values (Imputation):** Various imputation techniques were applied to fill in the missing values in both numerical and categorical features.
3. **Scale Data: Numerical features were scaled using three options:** Standard scaling, MinMax scaling, and Robust scaling, to ensure that they are appropriately represented for model training.

The dataset under analysis contains both numerical and categorical features with missing values, and the goal is to make these features suitable for machine learning models by ensuring no missing data remains. Group-level aggregation is also utilized for imputation, where features are grouped based on categorical and numerical columns to apply targeted filling strategies.

# **`2.` Imputation Techniques Implementation**

In [None]:
# Import Libraries
!pip install dask[dataframe] --quiet

import pandas as pd
import numpy as np
import re
from itertools import product

!pip install matplotlib --upgrade --quiet
from matplotlib import rcParams

import matplotlib.pyplot as plt
from sklearn.preprocessing import StandardScaler, MinMaxScaler, RobustScaler, LabelEncoder
from sklearn.model_selection import train_test_split, cross_val_score
from sklearn.metrics import make_scorer, mean_squared_error
from sklearn.experimental import enable_iterative_imputer
from sklearn.impute import SimpleImputer, KNNImputer, IterativeImputer

from xgboost import XGBRegressor
from lightgbm import LGBMRegressor
!pip install catboost --quiet
from catboost import CatBoostRegressor
from sklearn.ensemble import (GradientBoostingRegressor, ExtraTreesRegressor,
                              RandomForestRegressor,
                              AdaBoostRegressor)
from sklearn.neural_network import MLPRegressor
from sklearn.linear_model import Ridge, Lasso
from sklearn.tree import DecisionTreeRegressor

from datetime import datetime
import warnings
warnings.filterwarnings('ignore')

# Set up
seed = 42  # Random seed for reproducibility
np.random.seed(seed)

# Set Matplotlib defaults
plt.style.use('fivethirtyeight')
rcParams['figure.figsize'] = [5,3]
rcParams.update({'font.size': 8})

# Define Imputation Strategies
# Global variables
simple_cat = SimpleImputer(strategy='most_frequent')
constant_cat = SimpleImputer(strategy='constant', fill_value='unknown')
simple_mean_imputer = SimpleImputer(strategy='mean')
simple_median_imputer = SimpleImputer(strategy='median')
iterative_num_imputer = IterativeImputer(random_state=seed)

NaN_models = ['CatBoost']

categorical_imputers = {
    'Simple Categorical': simple_cat,
    'Constant Categorical': constant_cat,
    'Group Mode': None,
}

# Encapsulate the functionality for data imputation and model evaluation
class DataImputer:
    def __init__(self, n_neighbors=5):
        # Initialize imputer settings and model constructors
        self.categorical_imputers = categorical_imputers
        self.numerical_imputers = numerical_imputers = {
            'Simple Mean': simple_mean_imputer,
            'Simple Median': simple_median_imputer,
            'KNN': KNNImputer(n_neighbors=n_neighbors),
            'No Num Imputation': None,
            'Group Mean': None,
            'Iterative': iterative_num_imputer
            }

        self.NaN_models = NaN_models

        # Model names and corresponding constructor functions
        self.model_constructors = {
            'CatBoost': lambda **kwargs: CatBoostRegressor(cat_features=kwargs.get('cat_features', None), random_state=seed, verbose=0),
            'LGBM': lambda **kwargs: LGBMRegressor(random_state=seed, force_col_wise=True, verbose=-1, **kwargs),
            'GradientBoosting': lambda **kwargs: GradientBoostingRegressor(random_state=seed, **kwargs),
            'ExtraTrees': lambda **kwargs: ExtraTreesRegressor(random_state=seed, **kwargs),
            'XGB': lambda **kwargs: XGBRegressor(random_state=seed, **kwargs),
            'AdaBoost': lambda **kwargs: AdaBoostRegressor(DecisionTreeRegressor(**kwargs), random_state=seed),  # Base estimator: DecisionTree
            'DecisionTree': lambda **kwargs: DecisionTreeRegressor(random_state=seed, **kwargs),
            'RandomForest': lambda **kwargs: RandomForestRegressor(random_state=seed, **kwargs),
            'MLPRegressor': lambda **kwargs: MLPRegressor(random_state=seed, **kwargs),
            'Ridge': lambda **kwargs: Ridge(random_state=seed, **kwargs),
            'Lasso': lambda **kwargs: Lasso(random_state=seed, **kwargs),
        }

    @staticmethod
    def visualize_imputation_scores(rmses, stds, labels, model_name):
        """
        Plot RMSE scores for different imputation techniques.
        """

        fig, ax = plt.subplots(figsize=(6, 8))

        colors = plt.cm.Paired(np.linspace(0, 1, len(rmses)))
        y_positions = np.arange(len(rmses))

        ax.barh(
            y_positions,
            rmses,
            xerr=stds,
            color=colors,
            alpha=0.6,
            align="center"
        )

        ax.set_title(f"{model_name} Model Scores with Respect to Imputation Techniques")
        ax.set_xlabel("RMSE")
        ax.set_yticks(y_positions)
        ax.set_yticklabels(labels)
        ax.invert_yaxis()  # Highest RMSE on top

        ax.set_xlim(left=np.min(rmses) * 0.9, right=np.max(rmses) * 1.1)
        plt.show()

    @staticmethod
    def cat_num_features(df):
        """
        Identify categorical and numerical features in the DataFrame.
        """

        cat_features = df.select_dtypes(include=['object', 'category']).columns.tolist()
        num_features = df.select_dtypes(exclude=['object', 'category']).columns.tolist()
        return cat_features, num_features

    @staticmethod
    def classify_cardinality(X, low_thrs=20, high_thrs=50):
        """
        Classify categorical features based on their cardinality into low, medium, and high.
        """

        low_card, med_card, high_card = [], [], []
        cat_features, num_features = DataImputer.cat_num_features(X)

        for col in cat_features:
            unique_count = X[col].nunique()

            if unique_count < low_thrs:
                low_card.append(col)
            elif low_thrs <= unique_count <= high_thrs:
                med_card.append(col)
            else:
                high_card.append(col)

        return low_card, med_card, high_card

    @staticmethod
    def encode_data_cardinality(X, X_test=None, drop_first=False, apply_cardinality=True, low_thrs=20, high_thrs=50):
        """
        Encode data based on cardinality, using dummy encoding for low cardinality features and
        label encoding for medium and high cardinality features.
        """
        # Check if X is a pandas DataFrame
        if not isinstance(X, pd.DataFrame):
            raise TypeError("X must be a pandas DataFrame.")

        X_encoded = X.copy()
        X_test_encoded = X_test.copy() if X_test is not None else None

        if apply_cardinality:
            # classifies features based on their cardinality
            low_card, med_card, high_card = DataImputer.classify_cardinality(X, low_thrs=low_thrs, high_thrs=high_thrs)

            # Dummy encoding for low cardinality features
            X_encoded = pd.get_dummies(X, columns=low_card, drop_first=drop_first)
            X_encoded = X_encoded.rename(columns=lambda x: re.sub('[^A-Za-z0-9_]+', '', x))

            if X_test_encoded is not None:
                X_test_encoded = pd.get_dummies(X_test_encoded, columns=low_card, drop_first=drop_first)
                X_test_encoded = X_test_encoded.rename(columns=lambda x: re.sub('[^A-Za-z0-9_]+', '', x))

                # Align test data with train data
                X_encoded, X_test_encoded = X_encoded.align(X_test_encoded, join='left', axis=1, fill_value=0)

            # Label encoding for medium and high cardinality features
            med_high = med_card + high_card
            if len(med_high) > 0:
                for col in med_high:
                    le = LabelEncoder()
                    X_encoded[col] = le.fit_transform(X_encoded[col])

                    # Encode test data if available
                    if X_test_encoded is not None:
                        try:
                            X_test_encoded[col] = le.transform(X_test_encoded[col])
                        except ValueError as e:
                            X_test_encoded[col] = X_test_encoded[col].apply(lambda x: le.transform([x])[0] if x in le.classes_ else -1)
        else:
            # Dummy encoding for all categorical features
            X_encoded = pd.get_dummies(X, drop_first=drop_first)
            X_encoded = X_encoded.rename(columns=lambda x: re.sub('[^A-Za-z0-9_]+', '', x))

            # Encode test data if available
            if X_test_encoded is not None:
                X_test_encoded = pd.get_dummies(X_test_encoded, drop_first=drop_first)
                X_test_encoded = X_test_encoded.rename(columns=lambda x: re.sub('[^A-Za-z0-9_]+', '', x))

                # Set of features by aligning their columns
                X_encoded, X_test_encoded = X_encoded.align(X_test_encoded, join='left', axis=1, fill_value=0)

        if X_test_encoded is not None:
            return X_encoded, X_test_encoded
        else:
            return X_encoded

    @staticmethod
    def scale_data(X_train, X_test=None, scale_type='Standard'):
        """
        Scale data using StandardScaler, MinMaxScaler, or RobustScaler.
        """

        if scale_type == 'Standard':
            scaler = StandardScaler()
        elif scale_type == 'MinMax':
            scaler = MinMaxScaler()
        elif scale_type == 'Robust':
            scaler = RobustScaler()
        else:
            raise ValueError(f"'{scale_type}' is not a valid scaling type.")

        if not isinstance(X_train, pd.DataFrame):
            raise TypeError("X_train must be a pandas DataFrame.")

        X_train_scaled = scaler.fit_transform(X_train)
        X_train_scaled = pd.DataFrame(X_train_scaled, columns=X_train.columns)

        if X_test is not None:
            if not isinstance(X_test, pd.DataFrame):
                raise TypeError("X_test  must be a pandas DataFrame.")

            X_test_scaled = scaler.transform(X_test)
            X_test_scaled = pd.DataFrame(X_test_scaled, columns=X_test.columns)

            return X_train_scaled, X_test_scaled

        return X_train_scaled

    @staticmethod
    def rmse_score(y_true, y_pred):
        """
        Calculate the Root Mean Squared Error (RMSE).
        """
        if len(y_true) != len(y_pred):
            raise ValueError("Input arrays must have the same length.")
        return np.sqrt(mean_squared_error(y_true, y_pred))

    @staticmethod
    def cross_val_score_func(model, X, y, cv=3):
        """
        Perform cross-validation and return RMSE scores.
        """

        rmse_scorer = make_scorer(mean_squared_error, squared=False)
        scores = cross_val_score(model, X, y, scoring=rmse_scorer, cv=cv, n_jobs=-1)
        return scores

    def get_regressor(self, model_name, cat_features=None, params=None):
        """
        Retrieve a regression model based on the model name.
        """

        # Check if model name is valid
        if model_name not in self.model_constructors:
            raise ValueError(f"'{model_name}' is not a valid model.")

        # Check cat_features in CatBoost model
        if model_name == 'CatBoost':
            if cat_features is None:
                raise ValueError("CatBoostRegressor requires 'cat_features' to be specified.")
            else:
                params = params or {}  # Create an empty dict if params is None
                params['cat_features'] = cat_features  # Add cat_features to params

        # Seçilen modeli döndür
        return self.model_constructors[model_name](**(params if params is not None else {}))

    @staticmethod
    def group_imputation(df_org, group_cols_cat, group_cols_num, strategy='mean'):
        """
        Impute missing values based on group statistics (mean or mode).
        """
        df = df_org.copy()
        if strategy == 'mean':
            for group_col in group_cols_num:
                for feature in group_col['features']:
                        df[feature] = df.groupby(group_col['group_cols'])[feature].transform(
                            lambda x: x.fillna(x.mean() if x.mean() == x.mean() else x))

        elif strategy == 'mode':
            for group_col in group_cols_cat:
                for feature in group_col['features']:
                        df[feature] = df.groupby(group_col['group_cols'])[feature].transform(
                            lambda x: x.fillna(x.mode().iloc[0] if not x.mode().empty else 'unknown'))
        else:
            raise ValueError(f"Unsupported imputation strategy: {strategy}")

        return df

    @staticmethod
    def apply_imputation(X, features, imputer, group_cols_cat, group_cols_num, strategy=None, X_test=None):
        """
        Apply imputation to the specified features.
        """
        if imputer:
            if X_test is not None:
                return imputer.fit_transform(X[features]), imputer.transform(X_test[features])
            else:
                return imputer.fit_transform(X[features]), None

        elif strategy:
            if X_test is not None:
                return DataImputer.group_imputation(X, group_cols_cat, group_cols_num, strategy=strategy), DataImputer.group_imputation(X_test, group_cols_cat, group_cols_num, strategy=strategy)
            else:
                return DataImputer.group_imputation(X, group_cols_cat, group_cols_num, strategy=strategy), None

        return X[features], X_test[features] if X_test is not None else None

    @staticmethod
    def impute(X, cat_name, num_name, cat_imputer, num_imputer, num_features, cat_features, missing_numerical, missing_categorical, group_cols_cat, group_cols_num, apply_cardinality=True, X_test=None):
        """
        Perform imputation for both numerical and categorical features.
        """
        X_num_imputed, X_test_num_imputed = DataImputer.apply_imputation(X,
                                                            num_features,
                                                            num_imputer,
                                                            group_cols_cat,
                                                            group_cols_num,
                                                            strategy='mean' if num_name == 'Group Mean' else None,
                                                            X_test=X_test)
        X_num_imputed = pd.DataFrame(X_num_imputed, columns=num_features)

        X_cat_imputed, X_test_cat_imputed = DataImputer.apply_imputation(
            X,
            cat_features,
            cat_imputer,
            group_cols_cat,
            group_cols_num,
            strategy='mode' if cat_name == 'Group Mode' else None,
            X_test=X_test)


        X_cat_imputed = pd.DataFrame(X_cat_imputed, columns=cat_features)

        X_imputed = pd.concat([X_num_imputed.reset_index(drop=True), X_cat_imputed.reset_index(drop=True)], axis=1)

        if X_test is not None:
            X_test_num_imputed = pd.DataFrame(X_test_num_imputed, columns=num_features)
            X_test_cat_imputed = pd.DataFrame(X_test_cat_imputed, columns=cat_features)
            X_test_imputed = pd.concat([X_test_num_imputed.reset_index(drop=True), X_test_cat_imputed.reset_index(drop=True)], axis=1)

            return X_imputed, X_test_imputed

        return X_imputed, None

    def imputation_model_scores(self, X, y, model_name, missing_categorical, missing_numerical, group_cols_cat, group_cols_num, cv, imputers=None, visualize=False, apply_cardinality=True, X_test=None):
        """
        Evaluate different imputation techniques using a specific model.
        """
        # Check if model name is valid
        if model_name not in self.model_constructors:
            raise ValueError(f"'{model_name}' is not a valid model.")

        # Get categorical and numerical features
        cat_features, num_features = DataImputer.cat_num_features(X)
        # Check for CatBoostRegressor
        if 'CatBoost' in model_name:
            if cat_features is None or len(cat_features) == 0:
                raise ValueError("CatBoostRegressor requires 'cat_features' to be specified and cannot be empty.")
            regressor = self.get_regressor(model_name, cat_features)
        else:
            regressor = self.get_regressor(model_name)

        rmses, stds, x_labels = [], [], []
        # imputation_scores = []
        best_train_imputed = {}
        best_test_imputed = {}

        # Combination of various techniques
        for (num_name, num_imputer), (cat_name, cat_imputer) in product(self.numerical_imputers.items(), self.categorical_imputers.items()):
            if imputers is not None:
                if num_name not in imputers or cat_name not in imputers:
                    continue

            # Group Mean and Group Mode controls
            if num_name == 'Group Mean' and group_cols_num is None:
                print("Warning: 'group_cols_num' None. Group-based imputation will not be possible.")
                continue

            if cat_name == 'Group Mode' and group_cols_cat is None:
                print(" Warning: 'group_cols_cat' None. Group-based imputation will not be possible.")
                continue

            if model_name not in self.NaN_models:
                if 'Group' in num_name:
                    continue
                if 'No' in num_name:
                    continue

            imputation_name = f'{cat_name} and {num_name}'

            # Perform the imputation process
            X_imputed, X_test_imputed = DataImputer.impute(
                X,
                cat_name,
                num_name,
                cat_imputer,
                num_imputer,
                num_features,
                cat_features,
                missing_numerical,
                missing_categorical,
                group_cols_cat,
                group_cols_num,
                apply_cardinality=apply_cardinality,
                X_test=X_test)

            # Missing value check
            missing_in_train = X_imputed.isnull().values.any()
            missing_in_test = X_test_imputed is not None and X_test_imputed.isnull().values.any()

            if missing_in_train or missing_in_test:
                # raise ValueError(f"There are still missing values in the DataFrame with {imputation_name} for {model_name}.")
                print(f"Warning: {imputation_name} imputation resulted in missing values for {model_name}.")

                # Skip the loop if the model is not CatBoost
                if 'CatBoost' not in model_name:
                    continue

            best_train_imputed[imputation_name] = X_imputed
            best_test_imputed[imputation_name] = X_test_imputed

            # If the model is not CatBoost and there are no missing values, apply encoding
            if 'CatBoost' not in model_name:
                if X_test is not None:
                    X_imputed, X_test_imputed = DataImputer.encode_data_cardinality(
                        X_imputed,
                        X_test= X_test_imputed,
                        drop_first=True,
                        apply_cardinality=apply_cardinality)
                else:
                    X_imputed = DataImputer.encode_data_cardinality(
                        X_imputed,
                        drop_first=True,
                        apply_cardinality=apply_cardinality)

            # Scale the data if needed
            if model_name in ['MLPRegressor', 'Ridge', 'Lasso']:
                X_imputed = DataImputer.scale_data(X_imputed)

            # Split the data
            if cv == 1:
                X_train, X_val, y_train, y_val = train_test_split(
                    X_imputed,
                    y,
                    test_size=.2,
                    random_state=seed)

                regressor.fit(X_train, y_train)

                y_pred = regressor.predict(X_val)
                rmse = DataImputer.rmse_score(y_val, y_pred)
                # print(f"RMSE: {rmse:.4f}\n")

                rmses.append(rmse)
                stds.append(0)
                # imputation_scores.append({'methods': imputation_name,'RMSE': rmse, 'Std': 0})
            else:
                scores = DataImputer.cross_val_score_func(regressor, X_imputed, y, cv=cv)
                # print(f"RMSE: {scores.mean():.4f}, Std: {scores.std():.4f}\n")
                rmses.append(scores.mean())
                stds.append(scores.std())
                # imputation_scores.append({'methods': imputation_name,'RMSE': scores.mean(), 'Std': scores.std()})

            x_labels.append(imputation_name)

        # Sort the results in ascending order
        sorted_indices = np.argsort(rmses)
        sorted_rmses = np.array(rmses)[sorted_indices]
        sorted_stds = np.array(stds)[sorted_indices]
        sorted_labels = np.array(x_labels)[sorted_indices]

        best_index = np.argmin(sorted_rmses)
        best_method = sorted_labels[best_index]
        best_rmse, best_std = sorted_rmses[best_index], sorted_stds[best_index]

        best_train = best_train_imputed[best_method]
        best_test = best_test_imputed[best_method]

        if cv == 1:
            print(f"The best method for {model_name}: {best_method}, RMSE: {best_rmse:.4f}")
        else:
            print(f"The best method for {model_name}: {best_method}, RMSE: {best_rmse:.4f}, Std: {best_std:.4f}")

        if visualize:
            DataImputer.visualize_imputation_scores(sorted_rmses, sorted_stds, sorted_labels, model_name)

        return best_train, best_test, best_method, best_rmse, best_std


    def impute_and_evaluate_models(self, X, y, models, missing_categorical, missing_numerical, group_cols_cat=None, group_cols_num=None, imputers=None, visualize_first=False, visualize=False, cv=1, apply_cardinality=True, X_test=None):
        """
        Impute missing values and evaluate different models.
        """
        print("---START---")
        best_data = []

        for i, model_name in enumerate(models, 1):
            t1 = datetime.now()

            # Visualization for the first model
            if visualize_first and i == 1:
                train_imputed, test_imputed, best_method, best_rmse, best_std = self.imputation_model_scores(
                    X,
                    y,
                    model_name,
                    missing_categorical,
                    missing_numerical,
                    group_cols_cat,
                    group_cols_num,
                    cv,
                    imputers=imputers,
                    visualize=True,
                    apply_cardinality=True,
                    X_test=X_test)
            else:
                train_imputed, test_imputed, best_method, best_rmse, best_std = self.imputation_model_scores(
                    X,
                    y,
                    model_name,
                    missing_categorical,
                    missing_numerical,
                    group_cols_cat,
                    group_cols_num,
                    cv,
                    imputers=imputers,
                    visualize=visualize,
                    apply_cardinality=True,
                    X_test=X_test)

            best_data.append({'model_name': model_name, 'train': train_imputed, 'test': test_imputed, 'best_method': best_method, 'best_rmse':best_rmse, 'best_std':best_std})

            t2 = datetime.now()
            time_diff = t2 - t1
            print(f"Time taken: {time_diff}\n")
        print("---END---")
        return best_data

[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m242.2/242.2 kB[0m [31m3.0 MB/s[0m eta [36m0:00:00[0m
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m8.3/8.3 MB[0m [31m19.0 MB/s[0m eta [36m0:00:00[0m
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m98.7/98.7 MB[0m [31m8.0 MB/s[0m eta [36m0:00:00[0m
[?25h

#**`3.` Conclusion**

In this analysis,  a variety of imputation techniques were employed to address missing data in both numerical and categorical features. The results underscore the importance of choosing an imputation strategy tailored to the dataset's characteristics, as the method used can significantly influence the quality of predictions.

Group-based imputation methods, such as mean and mode imputation, were found to be particularly effective, enhancing data completeness while maintaining the relationships between features. This approach was particularly beneficial for tree-based models like the GradientBoostingRegressor, which leverage group similarities to improve predictive accuracy. Standard imputation methods, such as KNN and Simple Mean, also performed well under specific circumstances, showcasing their value in filling missing values.

To better understand the impact of KNN imputation, a 9-fold cross-validation was performed for n_neighbors values of 3, 5, and 7, similar to the approach by Sebastian Jäger and colleagues **(Jäger et al., 2021)**. This allowed for a detailed observation of how changes in the hyperparameter affect model performance. The results are given in the table below.

|Model|Imputation Methods|n\_splits|RMSE|
|---|---|---|---|
|CatBoostRegressor|Group Mode and KNN|3|83\.5513|
|CatBoostRegressor|Constant Categorical and KNN|5|83\.3924|
|CatBoostRegressor|Group Mode and KNN|7|83\.405|
|LGBMRegressor|Constant Categorical and KNN|3|83\.6108|
|LGBMRegressor|Simple Categorical and KNN|5|84\.0352|
|LGBMRegressor|Constant Categorical and KNN|7|83\.9112|
|GradientBoostingRegressor|Group Mode and KNN|3|82\.9256|
|GradientBoostingRegressor|Group Mode and KNN|5|82\.8678|
|GradientBoostingRegressor|Group Mode and KNN|7|82\.787|

The evaluation was conducted using both train-test split and cross-validation techniques, demonstrating that the choice of imputation method has a direct impact on model performance. For instance, KNN imputation, despite being frequently noted in literature for its effectiveness in predicting missing categorical data **(Memon, Wamala, & Kabano, 2023)**, was primarily applied here to numerical data, showing its versatility. This process not only ensured that no missing values remained in the dataset but also allowed for a clear assessment of how different models respond to various imputation strategies. The methods and RMSE results for each model showing the best performance, based on the train-test split, are presented in the table below.

|Model|Imputation Methods|RMSE|
|---|---|---|
|GradientBoostingRegressor|Group Mode and KNN|82\.8678|
|CatBoostRegressor|Constant Categorical and KNN|83\.3924|
|LGBMRegressor|Constant Categorical and Simple Mean|83\.8968|
|XGBoost|Constant Categorical and Simple Median|87\.4577|
|RandomForestRegressor|Constant Categorical and Simple Median|87\.6868|
|MLPRegressor|Simple Categorical and Simple Mean|88\.6755|
|ExtraTreesRegressor|Constant Categorical and KNN|89\.0288|
|AdaBoostRegressor|Constant Categorical and Simple Median|93\.4804|
|DecisionTreeRegressor|Group Mode and Simple Mean|119\.4029|

The key takeaway from this study is that combining standard and group-based imputation methods can yield optimal results, particularly for datasets with a diverse range of missing data types. The performance metrics, such as RMSE, highlighted in this analysis provide a benchmark for choosing suitable imputation techniques in future machine learning projects.


# **`4.` Contribution and Collaboration**

This notebook serves as a resource for analyzing imputation techniques for missing data handling. It can be applied to various datasets beyond the current scope. Feedback and contributions from those interested in enhancing this work are welcome. If you have suggestions, improvements, or additional techniques to share, please feel free to reach out!

#**`5.` References**

- Memon, S. M. Z., Wamala, R., & Kabano, I. H. (2023). A comparison of imputation methods for categorical data. *Informatics in Medicine Unlocked*, 42, 101382. https://doi.org/10.1016/j.imu.2023.101382

- Jäger, S., Allhorn, A., & Bießmann, F. (2021). A Benchmark for Data Imputation Methods. *Frontiers in Big Data*, 4. https://doi.org/10.3389/fdata.2021.693674
