# Comparing Imputation Methods

1. Import necessary libraries and the Titanic dataset.
2. Split the dataset into train and test sets.
3. Impute the `Age` column in the test set using global mean and regularized mean.
4. Compare the accuracy of both imputation methods.

## Step 1: Import the Titanic Dataset

In [17]:
import pandas as pd
import numpy as np
from sklearn.model_selection import train_test_split
from sklearn.metrics import mean_absolute_error

# Load the Titanic dataset
titanic_url = "https://raw.githubusercontent.com/datasciencedojo/datasets/master/titanic.csv"
titanic_data = pd.read_csv(titanic_url)

# Remove NaNs from the Age column
titanic_data = titanic_data.dropna(subset=['Age'])

# Split the data into training and test sets
train_data, test_data = train_test_split(titanic_data, test_size=0.2, random_state=42)

# Mask a fraction of the Age column in the test set to simulate missing values
mask_fraction = 0.5
num_samples = int(mask_fraction * len(test_data))
random_samples = test_data['Age'].sample(num_samples).index
test_data.loc[random_samples, 'Age'] = np.nan

## Step 2: Split, Mask, and Impute the `Age` Column in the Test Set

In [21]:
# Using Global Mean

# Compute the global mean from the training set
global_mean_age = train_data['Age'].mean()

# Impute the masked test set using the global mean
test_data_global_mean = test_data.copy()
test_data_global_mean['Age'] = test_data_global_mean['Age'].fillna(global_mean_age)

# Using Regularized Mean with `impute_column` function

from regmean_imputer import impute_column

# Impute the 'Age' column in the train and test sets using the regularized mean
imputed_train_data, imputed_test_data = impute_column(train_data=train_data, test_data=test_data, impute_col='Age', group_by_cols=['Pclass'])

## Step 3: Compare the Accuracy of Both Imputation Methods

In [20]:
# Compute the Mean Absolute Error (MAE) for both imputation methods
original_ages = titanic_data.loc[random_samples, 'Age']
mae_global_mean = mean_absolute_error(original_ages, test_data_global_mean['Age'].loc[random_samples])
mae_regularized_mean = mean_absolute_error(original_ages, imputed_test_data['Age'].loc[random_samples])

print(f"MAE using Global Mean: {mae_global_mean}")
print(f"MAE using Regularized Mean: {mae_regularized_mean}")

MAE using Global Mean: 11.522796428307142
MAE using Regularized Mean: 11.084314402097597


Regularized mean imputation outperforms global mean imputation.