# Comparing Imputation Methods

1. Import necessary libraries and the Titanic dataset.
2. Split the dataset into train and test sets.
3. Impute the `Age` column in the test set using global mean and regularized mean.
4. Compare the accuracy of both imputation methods.

## Step 1: Import, split, and mask the Titanic Dataset

In [5]:
import pandas as pd
import numpy as np
from sklearn.model_selection import train_test_split
from sklearn.metrics import mean_absolute_error

# Load the Titanic dataset
titanic_url = "https://raw.githubusercontent.com/datasciencedojo/datasets/master/titanic.csv"
titanic_data = pd.read_csv(titanic_url)

# Remove NaNs from the Age column
titanic_data = titanic_data.dropna(subset=['Age'])

# Split the data into training and test sets
train_data, test_data = train_test_split(titanic_data, test_size=0.2, random_state=42)

# Mask a fraction of the Age column in the test set to simulate missing values
mask_fraction = 0.5
num_samples = int(mask_fraction * len(test_data))
random_samples = test_data['Age'].sample(num_samples).index
test_data.loc[random_samples, 'Age'] = np.nan

## Step 2: Impute the `Age` Column in the Test Set

### 2.1: Using Global Mean

In [6]:
# Compute the global mean from the training set
global_mean_age = train_data['Age'].mean()

# Impute the masked test set using the global mean
test_data_global_mean = test_data.copy()
test_data_global_mean['Age'] = test_data_global_mean['Age'].fillna(global_mean_age)

# Ensure that there are no missing values in the imputed test set
assert test_data_global_mean['Age'].isnull().sum() == 0

### 2.2: Using Regularized Mean with `impute_column` function

In [7]:
from regmean_imputer import impute_column

# Impute the masked test set using the regularized mean
test_data_regularized_mean = impute_column(data=test_data, impute_col='Age', group_by_cols=['Pclass', 'Sex'])

# Ensure that there are no missing values in the imputed test set
assert test_data_regularized_mean['Age'].isnull().sum() == 0

Best regularization parameter for Age: 8


## Step 3: Compare the Accuracy of Both Imputation Methods

In [8]:
# Compute the Mean Absolute Error (MAE) for both imputation methods
original_ages = titanic_data.loc[random_samples, 'Age']
mae_global_mean = mean_absolute_error(original_ages, test_data_global_mean['Age'].loc[random_samples])
mae_regularized_mean = mean_absolute_error(original_ages, test_data_regularized_mean['Age'].loc[random_samples])

print(f"MAE using Global Mean: {mae_global_mean}")
print(f"MAE using Regularized Mean: {mae_regularized_mean}")

MAE using Global Mean: 11.419547865124192
MAE using Regularized Mean: 10.239185871309381


Regularized mean imputation outperforms global mean imputation.