# Comparing Imputation Methods

In this notebook, we'll compare the performance of two imputation methods on the Titanic dataset:
1. Simple Mean Imputation using `SimpleImputer` from `sklearn`.
2. Regularized Mean Imputation using our custom `RegularizedMeanImputer`.

## Step 1: Import, split, and mask the Titanic Dataset

In [20]:
import pandas as pd
import numpy as np
from sklearn.model_selection import train_test_split
from sklearn.metrics import mean_absolute_error
from sklearn.impute import SimpleImputer
from custom_imputers import RegularizedMeanImputer

# Load the Titanic dataset
titanic_url = "https://raw.githubusercontent.com/datasciencedojo/datasets/master/titanic.csv"
titanic_data = pd.read_csv(titanic_url)

# Remove NaNs from the Age column
titanic_data = titanic_data.dropna(subset=['Age'])

# Split the data into training and test sets
train_data, test_data = train_test_split(titanic_data, test_size=0.2, random_state=42)

# Mask a fraction of the Age column in the test set to simulate missing values
mask_fraction = 0.5
num_samples = int(mask_fraction * len(test_data))
random_samples = test_data['Age'].sample(num_samples).index
test_data.loc[random_samples, 'Age'] = np.nan

## Step 2: Impute the `Age` Column in the Test Set

### 2.1: Using Global Mean with `SimpleImputer`

In [21]:
simple_imputer = SimpleImputer(strategy="mean")
test_data_simple_imputed = test_data.copy(deep=True)
test_data_simple_imputed['Age'] = simple_imputer.fit_transform(test_data[['Age']])

### 2.2: Using Regularized Mean with `RegularizedMeanImputer`

In [31]:
imputer = RegularizedMeanImputer(impute_col='Age', group_by_cols=['Pclass', 'Sex'], add_indicator=True)
test_data_reg_imputed = test_data.copy(deep=True)
test_data_reg_imputed = imputer.fit_transform(test_data)
test_data_reg_imputed.head()


KeyError: 'key of type tuple not found and not a MultiIndex'

## Step 3: Compare the Accuracy of Both Imputation Methods

By examining the MAE values, we can determine which imputation method is more accurate. This will give us insights into the performance of the two methods when dealing with real-world missing data scenarios.

In [18]:
# Compute the Mean Absolute Error (MAE) for both imputation methods
original_ages = titanic_data.loc[random_samples, 'Age']
mae_simple = mean_absolute_error(original_ages, test_data_simple_imputed['Age'].loc[random_samples])
mae_regularized = mean_absolute_error(original_ages, test_data_reg_imputed['Age'].loc[random_samples])

print(f"MAE using Simple Imputer: {mae_simple}")
print(f"MAE using RegularizedMeanImputer: {mae_regularized}")

MAE using Simple Imputer: 11.862676056338028
MAE using RegularizedMeanImputer: 11.09385874220467


Regularized mean imputation significantly outperforms global mean imputation.