# AutoFillGluon: Machine Learning-Based Missing Data Imputation

This vignette demonstrates how to use the AutoFillGluon package for advanced imputation of missing data. AutoFillGluon leverages the power of AutoGluon to provide sophisticated, machine learning-based imputation for both numerical and categorical variables.

## Overview

AutoFillGluon offers the following key features:
- ML-based imputation using AutoGluon's predictive models
- Iterative refinement for improved quality
- Handles both numerical and categorical data
- Multiple imputation support
- Built-in evaluation of imputation quality
- Integration with survival analysis via custom scoring functions

## Setup

First, we'll install the required packages and import the necessary libraries.

In [1]:
# Install required packages if needed
#!pip install autogluon pandas numpy scikit-learn matplotlib seaborn lifelines

In [2]:
# Import required libraries
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
from sklearn.model_selection import train_test_split

# Import AutoFillGluon components
from autofillgluon import Imputer, multiple_imputation
from autofillgluon.utils import calculate_missingness_statistics, plot_imputation_evaluation
from autofillgluon import concordance_index_scorer, cox_ph_scorer, exponential_nll_scorer
from autogluon.tabular import TabularPredictor, TabularDataset

## Part 1: Basic Imputation with Synthetic Data

Let's start with a simple example using synthetic data to demonstrate the core functionality of the imputation process.

In [3]:
# Set random seed for reproducibility
np.random.seed(42)

# Create synthetic data with known relationships
def create_example_data(n_rows=100):
    """Create example data with some missing values."""
    # Create a dataframe with some correlation between columns
    df = pd.DataFrame({
        'age': np.random.normal(40, 10, n_rows),
        'income': np.random.normal(50000, 15000, n_rows),
        'experience': np.random.normal(15, 7, n_rows),
        'satisfaction': np.random.choice(['Low', 'Medium', 'High'], n_rows),
        'department': np.random.choice(['HR', 'Engineering', 'Sales', 'Marketing', 'Support'], n_rows)
    })
    
    # Add some correlations
    df['experience'] = df['age'] * 0.4 + np.random.normal(0, 3, n_rows)
    df['income'] = 20000 + df['experience'] * 2000 + np.random.normal(0, 5000, n_rows)
    
    # Add categorical biases
    df.loc[df['department'] == 'Engineering', 'income'] += 10000
    df.loc[df['department'] == 'Sales', 'income'] += 5000
    
    # Ensure proper data types
    df['satisfaction'] = df['satisfaction'].astype('category')
    df['department'] = df['department'].astype('category')
    
    # Create a complete copy before adding missing values
    df_complete = df.copy()
    
    # Add some missingness
    mask = np.random.random(df.shape) < 0.15
    for i in range(df.shape[0]):
        for j in range(df.shape[1]):
            if mask[i, j]:
                df.iloc[i, j] = np.nan
    
    return df, df_complete

# Generate example data
df_missing, df_complete = create_example_data(200)

# Display the first few rows of the dataset with missing values
print("First few rows of the dataset with missing values:")
df_missing.head()

First few rows of the dataset with missing values:


Unnamed: 0,age,income,experience,satisfaction,department
0,44.967142,63499.39642,20.746339,High,Marketing
1,38.617357,,,Low,Sales
2,46.476885,55242.97827,13.545293,Medium,Engineering
3,55.230299,,16.816322,High,Engineering
4,37.658466,44733.108004,12.445495,,


### Analyze Missing Data Patterns

Before performing imputation, it's important to understand the pattern of missingness in the data. AutoFillGluon provides utility functions to analyze missing data.

In [4]:
# Calculate and display missingness statistics
missing_stats = calculate_missingness_statistics(df_missing)

# Create a summary dataframe
summary = pd.DataFrame([{
    'Column': col,
    'Missing Count': stats['count_missing'],
    'Missing Percent': f"{stats['percent_missing']:.1f}%",
    'Data Type': df_missing[col].dtype
} for col, stats in missing_stats.items()])

summary

Unnamed: 0,Column,Missing Count,Missing Percent,Data Type
0,age,33,16.5%,float64
1,income,30,15.0%,float64
2,experience,44,22.0%,float64
3,satisfaction,21,10.5%,category
4,department,37,18.5%,category


### Basic Imputation

Now we'll use the `Imputer` class to fill in the missing values in our dataset. The imputer will train a separate machine learning model for each column with missing values, using the other columns as features.

In [5]:
# Initialize imputer with conservative settings for this example
imputer = Imputer(
    num_iter=2,          # Number of iterations for imputation refinement
    time_limit=20,       # Time limit per column model (seconds)
    verbose=True         # Show progress information
)

# Fit imputer on data with missing values
df_imputed = imputer.fit(df_missing)

# Display the first few rows of the imputed dataset
print("First few rows of the imputed dataset:")
df_imputed.head()

2025-04-06 13:34:22,176 - autofillgluon.imputer.imputer - INFO - Fitting the imputer to the data...
2025-04-06 13:34:22,188 - autofillgluon.imputer.imputer - INFO - Iteration 1/2
2025-04-06 13:34:22,190 - autofillgluon.imputer.imputer - INFO - Processing column: department
AutoGluon infers your prediction problem is: 'multiclass' (because dtype of label-column == category).
	If 'multiclass' is not the correct problem_type, please manually specify the problem_type parameter during predictor init (You may specify problem_type as one of: ['binary', 'multiclass', 'regression'])


Fitting the imputer to the data...
Iteration 1/2
Processing column: department


: 

: 

### Evaluating Imputation Quality

Since we have the original complete data, we can evaluate how well our imputation performed by comparing the imputed values with the true values.

In [None]:
# For numeric columns, we'll calculate correlation and error metrics
numeric_cols = ['age', 'income', 'experience']
numeric_results = []

for col in numeric_cols:
    # Find indices with missing values in the original data
    missing_mask = df_missing[col].isnull()
    if missing_mask.sum() > 0:
        # Get true and imputed values
        true_vals = df_complete.loc[missing_mask, col]
        imputed_vals = df_imputed.loc[missing_mask, col]
        
        # Calculate metrics
        corr = np.corrcoef(true_vals, imputed_vals)[0, 1]
        mae = np.abs(true_vals - imputed_vals).mean()
        mse = ((true_vals - imputed_vals) ** 2).mean()
        
        numeric_results.append({
            'Column': col,
            'Correlation': f"{corr:.4f}",
            'MAE': f"{mae:.4f}",
            'MSE': f"{mse:.4f}"
        })

# Display numeric results
pd.DataFrame(numeric_results)

In [None]:
# For categorical columns, we'll calculate accuracy
categorical_cols = ['satisfaction', 'department']
categorical_results = []

for col in categorical_cols:
    # Find indices with missing values in the original data
    missing_mask = df_missing[col].isnull()
    if missing_mask.sum() > 0:
        # Get true and imputed values
        true_vals = df_complete.loc[missing_mask, col]
        imputed_vals = df_imputed.loc[missing_mask, col]
        
        # Calculate accuracy
        accuracy = (true_vals == imputed_vals).mean()
        
        categorical_results.append({
            'Column': col,
            'Accuracy': f"{accuracy:.4f}",
            'Correct': f"{(true_vals == imputed_vals).sum()}/{len(true_vals)}"
        })

# Display categorical results
pd.DataFrame(categorical_results)

In [None]:
# Visualize imputation results for a numeric column (age)
missing_mask = df_missing['age'].isnull()
if missing_mask.sum() > 0:
    plt.figure(figsize=(10, 6))
    plt.scatter(df_complete.loc[missing_mask, 'age'], 
                df_imputed.loc[missing_mask, 'age'], 
                alpha=0.7)
    plt.plot([df_complete['age'].min(), df_complete['age'].max()], 
            [df_complete['age'].min(), df_complete['age'].max()], 
            'r--')
    
    # Add regression line
    sns.regplot(x=df_complete.loc[missing_mask, 'age'], 
                y=df_imputed.loc[missing_mask, 'age'], 
                scatter=False, color='blue')
    
    # Calculate correlation coefficient
    corr = np.corrcoef(df_complete.loc[missing_mask, 'age'], df_imputed.loc[missing_mask, 'age'])[0, 1]
    plt.text(0.05, 0.95, f'Correlation: {corr:.4f}', 
            transform=plt.gca().transAxes, fontsize=12)
    
    plt.xlabel('True Age')
    plt.ylabel('Imputed Age')
    plt.title('True vs Imputed Values for Age')
    plt.grid(alpha=0.3)
    plt.tight_layout()

### Using the Built-in Evaluation Function

The `Imputer` class includes a built-in `evaluate` method that can assess imputation performance by artificially introducing missingness into complete data.

In [None]:
# Use the built-in evaluation function
eval_results = imputer.evaluate(
    data=df_complete,        # Complete data without missing values
    percentage=0.15,         # Percentage to set as missing
    n_samples=3              # Number of evaluation samples
)

# Format results for display
evaluation_summary = []
for col, metrics in eval_results.items():
    for metric_name, values in metrics.items():
        evaluation_summary.append({
            'Column': col,
            'Metric': metric_name,
            'Mean': f"{values['mean']:.4f}",
            'Std Dev': f"{values['std']:.4f}",
            'Min': f"{values['min']:.4f}",
            'Max': f"{values['max']:.4f}"
        })

pd.DataFrame(evaluation_summary)

### Saving and Loading the Imputer

After training an imputer, you can save it for later use on new data.

In [None]:
# Save the imputer
imputer.save('example_imputer')

# Load the imputer back
loaded_imputer = Imputer.load('example_imputer')

# Create new data with missing values
new_data = pd.DataFrame({
    'age': [35, np.nan, 42, 55, np.nan],
    'income': [np.nan, 45000, 60000, np.nan, 75000],
    'experience': [10, 15, np.nan, 25, 20],
    'satisfaction': ['Medium', np.nan, 'High', 'Low', 'Medium'],
    'department': [np.nan, 'Engineering', 'Sales', 'HR', np.nan]
})

# Apply loaded imputer to new data
new_data_imputed = loaded_imputer.transform(new_data)

# Display results
pd.concat([new_data.add_suffix('_original'), new_data_imputed.add_suffix('_imputed')], axis=1)

## Part 2: Multiple Imputation

Multiple imputation creates several imputed datasets to account for the uncertainty in imputation. This approach allows for more robust statistical inference when working with imputed data.

In [None]:
# Perform multiple imputation
imputed_datasets = multiple_imputation(
    data=df_missing,       # Data with missing values
    n_imputations=3,       # Number of imputed datasets to create
    fitonce=True,          # Fit one model and use it for all imputations
    num_iter=2,            # Number of iterations for each imputation
    time_limit=15,         # Time limit per column model (seconds)
    verbose=True           # Show progress information
)

# Compare the first row of imputed values across datasets
comparison = pd.DataFrame({
    f"Imputation {i+1}": dataset.iloc[0] 
    for i, dataset in enumerate(imputed_datasets)
})

comparison

### Analyzing Multiple Imputation Results

Here we'll see how imputed values can vary across multiple imputations and how to incorporate this uncertainty into analyses.

In [None]:
# Let's look at imputed values for a specific variable across datasets
col_to_analyze = 'income'

# Find rows with missing values for this column
missing_mask = df_missing[col_to_analyze].isnull()
missing_indices = df_missing.index[missing_mask]

if len(missing_indices) > 0:
    # Compare imputed values across datasets
    imputed_values = pd.DataFrame({
        f"Imputation {i+1}": dataset.loc[missing_indices, col_to_analyze]
        for i, dataset in enumerate(imputed_datasets)
    })
    
    # Add the true values if available
    imputed_values['True Value'] = df_complete.loc[missing_indices, col_to_analyze]
    
    # Calculate mean and standard deviation across imputations
    imputed_values['Mean'] = imputed_values.iloc[:, :-1].mean(axis=1)
    imputed_values['Std Dev'] = imputed_values.iloc[:, :-1].std(axis=1)
    
    # Display the results
    imputed_values

In [None]:
# Visualize variation in imputed values for a numeric column
if len(missing_indices) > 0:
    # Create a boxplot of imputed values across datasets
    plt.figure(figsize=(12, 6))
    
    # Get data for plotting
    plot_data = pd.melt(
        imputed_values.iloc[:, :-3], 
        var_name='Imputation', 
        value_name='Imputed Value'
    )
    
    # Add true values as a separate column
    plot_data['True Value'] = np.repeat(imputed_values['True Value'].values, len(imputed_datasets))
    
    # Create boxplot
    sns.boxplot(data=plot_data, x='Imputation', y='Imputed Value')
    
    # Overlay true values
    for i, imp in enumerate(plot_data['Imputation'].unique()):
        subset = plot_data[plot_data['Imputation'] == imp]
        plt.scatter(
            x=[i] * len(subset), 
            y=subset['True Value'], 
            color='red', 
            marker='x', 
            s=50, 
            label='True Value' if i == 0 else None
        )
    
    plt.xlabel('Imputation Dataset')
    plt.ylabel(f'Imputed Value for {col_to_analyze}')
    plt.title(f'Variation in Imputed Values Across Multiple Imputations\n(Red X = True Value)')
    plt.legend()
    plt.grid(alpha=0.3)
    plt.tight_layout()

## Part 3: Survival Analysis with AutoFillGluon

AutoFillGluon includes custom scoring functions for survival analysis. In this section, we'll demonstrate how to use these scorers with the lifelines Rossi recidivism dataset.

In [None]:
# Load the Rossi recidivism dataset from lifelines
from lifelines.datasets import load_rossi
from lifelines import KaplanMeierFitter

rossi = load_rossi()

# Ensure time column is float
rossi['week'] = rossi['week'].astype(float)

# Display dataset info
print(f"Dataset shape: {rossi.shape}")
print("\nColumn descriptions:")
print("- week: Week of first arrest after release or end of study")
print("- arrest: Arrested during study period? (1=yes, 0=no)")
print("- fin: Financial aid received? (1=yes, 0=no)")
print("- age: Age at release (years)")
print("- race: Race (1=black, 0=other)")
print("- wexp: Work experience (1=yes, 0=no)")
print("- mar: Married? (1=yes, 0=no)")
print("- paro: Released on parole? (1=yes, 0=no)")
print("- prio: Number of prior convictions")

# Show the first few rows
rossi.head()

In [None]:
# Prepare the survival data for AutoGluon
def prepare_survival_data(df, time_col, event_col):
    """Prepare survival data by encoding time and event in a single column."""
    # Create a copy to avoid modifying the original
    df_model = df.copy()
    
    # Create the time column (positive for events, negative for censored)
    df_model['time'] = df_model[time_col]
    # Encode censored observations with negative times
    df_model.loc[df_model[event_col] == 0, 'time'] = -df_model.loc[df_model[event_col] == 0, time_col]
    
    # Drop the original time and event columns
    df_model = df_model.drop(columns=[time_col, event_col])
    
    return df_model

# Prepare the dataset for AutoGluon
df_model = prepare_survival_data(rossi, 'week', 'arrest')

# Create a version with artificial missing values
np.random.seed(42)
mask = np.random.random(df_model.shape) < 0.15

# Create a copy with missing values (don't introduce missingness in the target column)
df_missing = df_model.copy()
for i in range(df_missing.shape[0]):
    for j in range(df_missing.shape[1]):
        # Skip the target column (time)
        if j != df_missing.columns.get_loc('time') and mask[i, j]:
            df_missing.iloc[i, j] = np.nan

# Show the missing data
print("\nMissing values per column:")
print(df_missing.isnull().sum())
df_missing.head()

### Imputation for Survival Data

Let's first impute the missing values in our survival dataset.

In [None]:
# Impute missing values
imputer = Imputer(num_iter=2, time_limit=15, verbose=True)
df_imputed = imputer.fit(df_missing)

# Convert to TabularDataset for AutoGluon
df_model = TabularDataset(df_model)
df_imputed = TabularDataset(df_imputed)

# Display imputation summary
print(f"Original shape: {df_model.shape}")
print(f"Missing data shape: {df_missing.shape}")
print(f"Imputed data shape: {df_imputed.shape}")

### Survival Analysis with Custom Scoring Functions

Now let's demonstrate the survival analysis capabilities of AutoFillGluon using various scoring functions.

In [None]:
# Common parameters for all models
common_params = {
    'label': 'time',       # The target variable
    'time_limit': 60,      # Time limit in seconds
    'presets': 'medium_quality',
    'verbosity': 0         # Reduce verbosity
}

# Train with Cox PH scorer
print("Training with Cox PH scorer...")
cox_predictor = TabularPredictor(eval_metric=cox_ph_scorer, **common_params)
cox_predictor.fit(df_model)

# Train with concordance index scorer
print("Training with Concordance Index scorer...")
cindex_predictor = TabularPredictor(eval_metric=concordance_index_scorer, **common_params)
cindex_predictor.fit(df_model)

# Train with exponential NLL scorer
print("Training with Exponential NLL scorer...")
exp_predictor = TabularPredictor(eval_metric=exponential_nll_scorer, **common_params)
exp_predictor.fit(df_model)

In [None]:
# Make predictions
cox_preds = cox_predictor.predict(df_model)
cindex_preds = cindex_predictor.predict(df_model)
exp_preds = exp_predictor.predict(df_model)

# Evaluate predictions using concordance index
from lifelines.utils import concordance_index

def evaluate_predictions(y_true, y_true_event, y_pred):
    """Evaluate predictions using concordance index."""
    # For concordance_index, higher predictions should indicate higher risk
    c_index = concordance_index(y_true, -y_pred, event_observed=y_true_event)
    return c_index

# Evaluate models
cox_cindex = evaluate_predictions(rossi['week'], rossi['arrest'], cox_preds)
cindex_cindex = evaluate_predictions(rossi['week'], rossi['arrest'], cindex_preds)
exp_cindex = evaluate_predictions(rossi['week'], rossi['arrest'], exp_preds)

# Display results
results = pd.DataFrame({
    'Model': ['Cox PH', 'Concordance Index', 'Exponential NLL'],
    'C-index': [cox_cindex, cindex_cindex, exp_cindex]
})

results

In [None]:
# Compare models and visualize risk scores
plt.figure(figsize=(15, 5))

plt.subplot(1, 3, 1)
plt.scatter(rossi['week'], -cox_preds, c=rossi['arrest'], cmap='viridis', alpha=0.7)
plt.colorbar(label='Event (1=arrest)')
plt.xlabel('Time (weeks)')
plt.ylabel('Risk Score')
plt.title('Cox PH Risk Scores')

plt.subplot(1, 3, 2)
plt.scatter(rossi['week'], -cindex_preds, c=rossi['arrest'], cmap='viridis', alpha=0.7)
plt.colorbar(label='Event (1=arrest)')
plt.xlabel('Time (weeks)')
plt.ylabel('Risk Score')
plt.title('C-index Risk Scores')

plt.subplot(1, 3, 3)
plt.scatter(rossi['week'], -exp_preds, c=rossi['arrest'], cmap='viridis', alpha=0.7)
plt.colorbar(label='Event (1=arrest)')
plt.xlabel('Time (weeks)')
plt.ylabel('Risk Score')
plt.title('Exponential NLL Risk Scores')

plt.tight_layout()

### Impact of Imputation on Survival Models

Let's compare model performance between complete data and imputed data.

In [None]:
# Train model on imputed data
imputed_predictor = TabularPredictor(eval_metric=cox_ph_scorer, **common_params)
imputed_predictor.fit(df_imputed)

# Compare the leaderboards
print("Original data leaderboard:")
original_leaderboard = cox_predictor.leaderboard(df_model, silent=True)[['model', 'score', 'pred_time_val']]
print(original_leaderboard)

print("\nImputed data leaderboard:")
imputed_leaderboard = imputed_predictor.leaderboard(df_imputed, silent=True)[['model', 'score', 'pred_time_val']]
print(imputed_leaderboard)

## Summary

In this vignette, we've demonstrated the key features of AutoFillGluon:

1. **Basic imputation** - Using the Imputer class to fill missing values with ML-based predictions
2. **Evaluation** - Assessing imputation quality through various metrics
3. **Multiple imputation** - Creating multiple imputed datasets to account for uncertainty
4. **Survival analysis** - Using custom scoring functions to train survival models with AutoGluon

AutoFillGluon provides a powerful toolkit for handling missing data in your machine learning projects, combining the ease of use of simple imputation methods with the predictive power of AutoGluon's machine learning models.