## AutoGluonImputer

This package offers a sophisticated solution for handling missing data in datasets using the AutoGluon TabularPredictor. It's adept at working with both numerical and categorical data and provides a machine-learning-driven approach for imputation.

### Import libraries

We start by loading libraries.

In [None]:
#!pip install --upgrade pandas numpy scikit-learn autogluon
import pandas as pd
import numpy as np
from sklearn.model_selection import train_test_split
from sklearn.datasets import fetch_openml
from autogluon.tabular import TabularDataset
from scripts.autogluonImputer import Imputer 
import importlib


: 

### Understanding the Imputer Class

Before we utilize the `Imputer` class for handling missing data, it's beneficial to understand its structure and functionalities. In the next cell, we'll retrieve and display the help documentation and source code for this class.


In [None]:
# get help about Imputer
help(Imputer)

# print the content of Imputer
import inspect
print(inspect.getsource(Imputer))

: 

#### Step 2: Prepare the Data


#### Data Preparation Overview

In this step, we load the Titanic dataset using `fetch_openml` and perform initial data preprocessing, including:
- Merging the features and target variable into a single DataFrame.
- Dropping less relevant columns like 'name' and 'ticket'.
- Displaying the first few rows of the DataFrame for a quick overview.


In [None]:

# Load the data
X, y = fetch_openml(
    "titanic", version=1, as_frame=True, return_X_y=True, parser="pandas"
)

# combine X and y in one dataframe
df=X.copy()
df['target']=y
df.head()
# drop name and ticket

df.drop(['name','ticket'],axis=1,inplace=True)


: 

#### Data Type Conversion

To optimize the dataset for use with the AutoGluon framework, we convert:
- String columns (object type) to categorical data types.
- Integer columns to float data types.

These conversions are necessary for the AutoGluon algorithms to process the data correctly and efficiently.


In [None]:
df=TabularDataset(df)
df.dtypes
# convert object to category
for col in df.columns:
    if df[col].dtype=='object':
        df[col]=df[col].astype('category')

# convert integer to float
for col in df.columns:
    if df[col].dtype=='int64':
        df[col]=df[col].astype('float64')

df.dtypes


: 

#### Introducing Missing Values

To simulate a realistic scenario where datasets often have missing values, we artificially introduce missingness into our training and test datasets. This step allows us to demonstrate the effectiveness of the `Imputer` class in dealing with incomplete data.


In [None]:

# Split the data into train and test sets
train, test = train_test_split(df, test_size=0.3, random_state=42)

# Introduce missingness
train_missing = train.mask(np.random.random(train.shape) < 0.2)
test_missing = test.mask(np.random.random(test.shape) < 0.2)


: 

## Imputing Missing Values with AutoGluonImputer

We use the `Imputer` class to fill in the missing values in our datasets. The `Imputer` is configured and then applied to both the training and test datasets to perform imputation. The settings for the number of iterations (`num_iter`) and time limit (`time_limit`) are adjustable parameters that control the imputation process.


In [None]:
imputer = Imputer(num_iter=2, time_limit=5)
train_imputed = imputer.fit(train_missing)
test_imputed = imputer.transform(test_missing)


: 

### Evaluating Imputation Quality

To assess the quality of the imputed values, we focus on the 'age' feature in the test dataset:
- We plot the imputed values against the original values.
- A scatter plot with a regression line helps visualize the accuracy of the imputation.
- We calculate the correlation coefficient between the imputed and original values to quantify the imputation accuracy.


In [None]:
# Compare imputed values with original values for the target variable
import matplotlib.pyplot as plt
import seaborn as sns
sns.set()
# Identify missing indices in test dataset
missing_indices_test = test_missing['age'].index[test_missing['age'].apply(np.isnan)]

# Plot imputed values against original values
plt.scatter(test_imputed['age'][missing_indices_test], test['age'][missing_indices_test])
plt.xlabel('Imputed Values')
plt.ylabel('Original Values')
plt.title('Imputed Values vs Original Values')
sns.regplot(x=test_imputed['age'][missing_indices_test], y=test['age'][missing_indices_test], scatter=False, color='red')
# Calculate and display the correlation coefficient
# put test_imputed['age'][missing_indices_test], test['age'][missing_indices_test] in a dataframe
df=pd.DataFrame({'imputed':test_imputed['age'][missing_indices_test], 'original':test['age'][missing_indices_test]})
# remove rows with missing values
df=df.dropna()
# calculate correlation coefficient
corr = np.corrcoef(df['imputed'], df['original'])[0,1]
plt.text(.6, .75, f'Correlation Coefficient = {round(corr, 2)}', horizontalalignment='center', verticalalignment='center', transform=plt.gca().transAxes, color='black')
plt.show()

: 

### Preview of Imputed Test Data

To get a sense of the results of the imputation, we preview the first few rows of the imputed test dataset. This helps in visually assessing the changes and imputations made by the `Imputer` class.


In [None]:
test_imputed.head()

: 

### Evaluating the Imputation Method

The `Imputer` class provides an `evaluate_imputation` method to assess the effectiveness of the imputation. This function simulates missingness in the data, imputes the values, and then compares the imputed values against the original data. It provides a quantitative measure of the imputation's accuracy.


In [None]:
imputer.evaluate_imputation(train, percentage=.2, ntimes=3)


: 

# Multiple Imputation

Multiple imputation is a statistical technique used to handle missing data by creating multiple complete datasets. Each dataset is imputed separately, and the results are typically pooled for analysis. This approach accounts for the uncertainty associated with imputation, often leading to more robust and reliable statistical inferences.


In [None]:
from scripts.autogluonImputer import multiple_imputation
num_iter=2
time_limit=10
train_imputed = multiple_imputation(train_missing, n_imputations=10, num_iter=num_iter, time_limit=time_limit, fitonce=True)


: 

In [None]:
train_imputed[0].head()

: 

In [None]:
train_imputed[1].head()

: 

In [None]:
train_imputed[1].dtypes

: 