# Comparing Imputation Methods

1. Import necessary libraries and the Titanic dataset.
2. Split the dataset into train and test sets.
3. Impute the `Age` column in the test set using global mean and regularized mean.
4. Compare the accuracy of both imputation methods.

## Step 1: Import the Titanic Dataset

In [1]:
import pandas as pd
import numpy as np
from sklearn.model_selection import train_test_split
from sklearn.metrics import mean_absolute_error
from regmean_imputer import impute_column

# Load the Titanic dataset
titanic_url = "https://raw.githubusercontent.com/datasciencedojo/datasets/master/titanic.csv"
titanic_data = pd.read_csv(titanic_url)

# Remove NaNs from the Age column
titanic_data = titanic_data.dropna(subset=['Age'])

## Step 2: Split, Mask, and Impute the `Age` Column in the Test Set

In [3]:
n_repeats = 20
mae_global_mean_list = []
mae_regularized_mean_list = []

for _ in range(n_repeats):
    # Split the data into training and test sets
    train_data, test_data = train_test_split(titanic_data, test_size=0.2)
    
    # Copy test_data and convert all Age values to NaNs
    original_ages = test_data['Age'].copy(deep=True)
    test_data['Age'] = np.nan

    # Impute the 'Age' column in the train and test sets using the regularized mean
    imputed_train_data, imputed_test_data = impute_column(train_data=train_data, test_data=test_data, impute_col='Age', group_by_cols=['Pclass', 'Sex'])

    # Impute the 'Age' column in the test set using the global mean
    global_mean_imputed_test_data = test_data.copy()
    global_mean_imputed_test_data['Age'] = global_mean_imputed_test_data['Age'].fillna(train_data['Age'].mean())

    # Compute the Mean Absolute Error (MAE) for both imputation methods
    mae_global_mean = mean_absolute_error(original_ages, global_mean_imputed_test_data['Age'])
    mae_regularized_mean = mean_absolute_error(original_ages, imputed_test_data['Age'])

    mae_global_mean_list.append(mae_global_mean)
    mae_regularized_mean_list.append(mae_regularized_mean)

Best regularization parameter for Age: 6
Best regularization parameter for Age: 6
Best regularization parameter for Age: 5
Best regularization parameter for Age: 9
Best regularization parameter for Age: 8
Best regularization parameter for Age: 5
Best regularization parameter for Age: 2
Best regularization parameter for Age: 2
Best regularization parameter for Age: 3
Best regularization parameter for Age: 1
Best regularization parameter for Age: 5
Best regularization parameter for Age: 1
Best regularization parameter for Age: 2
Best regularization parameter for Age: 3
Best regularization parameter for Age: 7
Best regularization parameter for Age: 5
Best regularization parameter for Age: 3
Best regularization parameter for Age: 7
Best regularization parameter for Age: 9
Best regularization parameter for Age: 7


## Step 3: Compare the Accuracy of Both Imputation Methods

In [4]:
# Compute the mean MAE for both imputation methods
mean_mae_global_mean = np.mean(mae_global_mean_list)
mean_mae_regularized_mean = np.mean(mae_regularized_mean_list)

print(f"Mean MAE using Global Mean over {n_repeats} repeats: {mean_mae_global_mean}")
print(f"Mean MAE using Regularized Mean over {n_repeats} repeats: {mean_mae_regularized_mean}")

Mean MAE using Global Mean over 20 repeats: 11.61663940087933
Mean MAE using Regularized Mean over 20 repeats: 10.6332905160187


Regularized mean imputation performs better than global mean imputation, reducing error by nearly 10%.