# $\color{purple}{\text{Understanding Missing Data and How to Deal with It (Part 5)}}$

## $\color{purple}{\text{Advanced Imputation Techniques}}$

In [None]:
### $\color{purple}{\text{Colab Environmental Setup}}$

In [None]:
from google.colab import drive
import os
drive.mount('/content/drive')
os.chdir('/content/drive/My Drive/missingness_tutorial')

### $\color{purple}{\text{Libraries for this lesson}}$

In [None]:
import pandas as pd
import numpy as np
from helpers import stat_comparison, spotlight_donors, ImputationDisplayer
from sklearn.ensemble import RandomForestClassifier
from sklearn.linear_model import LinearRegression
from sklearn.ensemble import RandomForestRegressor

from autoimpute.imputations import SingleImputer
from autoimpute.imputations import MultipleImputer
from autoimpute.imputations import MiceImputer

In [None]:
df = pd.read_csv('data/full_set.csv')
mar_df = pd.read_csv('data/mar_set.csv')
ImputationDisplayer(mar_df)

## $\color{purple}{\text{Multivariate Imputation}}$
Conventional Multivariate Imputation falls into 2 categories
* Regression Imputation
* Hot Deck Imputation

Another cutting edge method worth mentioning
* Neural Network Autoencoder

## $\color{purple}{\text{Regression Imputation}}$

General Technique:
Use Regression/Classification Models to impute Numeric/Categorical Missing Values
* Linear Regression
* Stocastic Linear Regression
* Logistic Regression
* Other Possibilities (generally unexplored)
  * Random Forest
  * Decision Trees
  * KNN

### $\color{purple}{\text{Linear Regression}}$

* Works with MAR
* Can impute illegal (out of bounds) values
* Can under estimate variance/covariance

In [None]:
linear_regressor = LinearRegression()

#### Perform the linear regresssion

We base the prediction of `feature a` on the remaining features in `rest`. We only run the regression on data with full rows, `full_data`.

In [None]:
rest = ['feature b', 'feature c', 'feature d', 'uncorrelated']
full_data = mar_df.dropna()
linear_regressor.fit(full_data[rest], full_data['feature a'])
predicted = linear_regressor.predict(mar_df[rest])

#### A note about a code pattern

I will be repeating the following code pattern or variation thereof. 

```.assign(**{'feature a': df['feature a'].where(~df['feature a'].isnull(), predicted)``` 

Depending on the use case, I'll either be filling in a value when the value is missing or substituting a NaN where a missing value is (see section on MICE below).

This basically substitutes the predicted value only when values are missing.

This is basically the same pattern as

```df['feature a'] = df['feature a'].where(~df['feature a'].isnull(), predicted)```

but allows for passing the dataframe or method chaining

In [None]:
imputed = mar_df.assign(
    **{
        'feature a':
        mar_df['feature a'].where(~mar_df['feature a'].isnull(), predicted)
    })

### $\color{purple}{\text{Analyze the Results}}$

In [None]:
stat_comparison(df, imputed, 'feature a')

In [None]:
mar_df.displayer(imputed, 15)

### $\color{purple}{\text{Stochastic Regression}}$
* Extends Linear Regression by adding noise modelling the residuals
* Better simulates variance
* Can also produce out of bounds values

We rely on the linear regression prediction above. And calculate the statistics behind the residuals of the linear regression.

In [None]:
residual = mar_df['feature a'] - predicted
residual.mean()
residual.std()

For the prediction we model the residual noise as a normal distribution and adjust predictions accordingly.

In [None]:
residual_noise = np.random.normal(residual.mean(), residual.std(), 20000)
predicted += residual_noise

In [None]:
imputed = mar_df.assign(
    **{
        'feature a':
        mar_df['feature a'].where(~mar_df['feature a'].isnull(), predicted)
    })

### $\color{purple}{\text{Analyze the Results}}$

### $\color{purple}{\text{Built into}}$ `autoimpute`

In [None]:
imputer = SingleImputer('least squares')
ls_imputations = imputer.fit_transform(mar_df)

In [None]:
from autoimpute.imputations import SingleImputer

imputer = SingleImputer('stochastic')
st_imputations = imputer.fit_transform(mar_df)

#### $\color{purple}{\text{Analyze Results}}$

### $\color{purple}{\text{Just For Fun}}$
Let's use a Random Forest Regression instead

In [None]:
rf_regressor = RandomForestRegressor()
rest = ['feature b', 'feature c', 'feature d', 'uncorrelated']
full_data = mar_df.dropna()
rf_regressor.fit(full_data[rest], full_data['feature a'])
predicted = rf_regressor.predict(mar_df[rest])

In [None]:
imputed = mar_df.assign(
    **{
        'feature a':
        mar_df['feature a'].where(~mar_df['feature a'].isnull(), predicted)
    })

#### $\color{purple}{\text{Analyze Results}}$


## $\color{purple}{\text{Categorical Variables}}$

Imputation of categorical variables employs classification in place of regression. Most common is multinomial logistic regression.

In [None]:
cat_mar_df = pd.read_csv('data/categorical_mar.csv')
ImputationDisplayer(cat_mar_df)

In [None]:
from sklearn.linear_model import LogisticRegression

rest = ['feature a', 'feature b', 'feature c']
from sklearn.preprocessing import LabelEncoder

cleaned_df = cat_mar_df.dropna()
lr = LogisticRegression(random_state=0,
                        max_iter=1000).fit(cleaned_df[rest],
                                           cleaned_df['cat feature'])

In [None]:
impute = lr.predict(cat_mar_df[rest])

In [None]:
imputed = cat_mar_df.assign(
    **{
        'cat feature':
        cat_mar_df['cat feature'].where(~cat_mar_df['cat feature'].isnull(),
                                        impute)
    })

In [None]:
cat_mar_df.displayer(imputed, 10)

## $\color{purple}{\text{Hot Deck Imputation}}$
* General idea is to randomly sample imputed values from remaining good values.
* Doesn't impute out of bounds values

**How it works:**

* For each missing value, a set of donors is selected from good values
* A value is randomly selected from the set of donors
* Donors are selected based on some metric based algorithm

The `demo_mar.csv` dataset is the first 10 entries from one of my earlier runs. It has one missing value in `feature a`

In [None]:
demo_df = pd.read_csv('data/demo_mar.csv')
demo_df

We use Euclidean distance to demonstrate how Hot Deck Imputation works, but in practice the metric is usually more statistically based and complex. For simplicity we add a `distance` feature.

In [None]:
def distance(x):
    return np.linalg.norm((x - demo_df.iloc[7]).dropna())


demo_df['distance'] = demo_df.apply(distance, axis=1)

#### Donor Selection

[Van Buuren](https://stefvanbuuren.name/fimd/) identifies 4 methods of selecting donors

#### Method 1: (Single Donor)

Pick the sample closest to the missing value

In [None]:
donor = demo_df.dropna().nsmallest(1, 'distance')
spotlight_donors(demo_df, donor)

#### Method 2:

Donors selected from all points under a fixed threshold

In [None]:
threshold = 2
donors = demo_df.dropna()[demo_df.dropna().distance < threshold]['feature a']
spotlight_donors(demo_df, donors)

#### Method 3:

Closest N points selected as the set of donors

In [None]:
N = 3
donors = demo_df.nsmallest(N + 1, 'distance').tail(N)['feature a']
spotlight_donors(demo_df, donors)

#### Method 4:

Donors are all points, but donor selected randomly based on the distance, closest having higher probability

In [None]:
import random
# Pick with probability inversely proportionally to distance
weights = 1 / demo_df.dropna()['distance']
random.choices(demo_df.dropna()['feature a'].to_list(),
               k=1,
               weights=weights.to_list())

## $\color{purple}{\text{Predictive Mean Matching}}$
Uses linear interpolation as part of the metric.

Basically, the donors are selected from those observations whose predicted values from linear regression most closely matches that predicted from the missing value.


In [None]:
from sklearn.linear_model import LinearRegression

linear_regressor = LinearRegression()

In [None]:
demo_df = pd.read_csv('data/demo_mar.csv')

In [None]:
rest = ['feature b', 'feature c', 'feature d', 'uncorrelated']
full_data = demo_df.dropna()
linear_regressor.fit(full_data[rest], full_data['feature a'])
demo_df['regression'] = linear_regressor.predict(demo_df[rest])

In [None]:
demo_df['distance'] = np.abs(demo_df.regression - demo_df.regression.iloc[7])

In [None]:
N = 3
donors = demo_df.dropna().sort_values('distance').iloc[0:N]['feature a']
spotlight_donors(demo_df, donors, 7)

Predictive Mean Matching is the preferred imputation method, but can be computationally expensive, so for this demo the dataset is truncated to 100 rows

In [None]:
from autoimpute.imputations import SingleImputer

demo_df = mar_df[0:100].copy()
imputer = SingleImputer('pmm')
imputations = imputer.fit_transform(demo_df)

In [None]:
mar_df.displayer(imputations, 10)

## $\color{purple}{\text{Advanced Imputation Techniques: Multivariate Imputation by Chained Equations (MICE)}}$
* Often considered the gold standard of imputation
* Is actually more of an imputation blueprint
* Applicable with missingness in multiple columns

In [None]:
dmcar_df = pd.read_csv('data/double_mcar_set.csv')
missing_df = pd.DataFrame({
    'feature a': dmcar_df['feature a'].isnull(),
    'feature b': dmcar_df['feature b'].isnull()
})
ImputationDisplayer(dmcar_df)

#### First step: Impute each missing value with some form of univariate imputation (usually mean or median)

In [None]:
step1_df = dmcar_df.fillna({
    'feature a': dmcar_df['feature a'].mean(),
    'feature b': dmcar_df['feature b'].median()
})
dmcar_df.displayer(step1_df, 20)

#### Second Step: For each column impute using a regression or hot deck technique
Start with `feature a` then `feature b`

##### Clear the missing values for the imputer then impute feature a

In [None]:
imputer = SingleImputer('least squares')
step2a_df = imputer.fit_transform(
    step1_df.assign(**{
        'feature a':
        step1_df['feature a'].where(~missing_df['feature a'], np.nan)
    }))
dmcar_df.displayer(step2a_df, 10)

In [None]:
imputer = SingleImputer('least squares')
step2_df = imputer.fit_transform(
    step2a_df.assign(
        **{
            'feature b':
            step2a_df['feature b'].where(~missing_df['feature b'], np.nan)
        }))
dmcar_df.displayer(step2_df, 20)

#### Repeat Step 2 until results converge sufficiently

In [None]:
imputer = SingleImputer('least squares')
step3a_df = imputer.fit_transform(
    step2_df.assign(**{
        'feature a':
        step2_df['feature a'].where(~missing_df['feature a'], np.nan)
    }))
step3_df = imputer.fit_transform(
    step3a_df.assign(
        **{
            'feature b':
            step3a_df['feature b'].where(~missing_df['feature b'], np.nan)
        }))
dmcar_df.displayer(step3_df, 20)

In [None]:
stat_comparison(df, step3_df, 'feature a')

stat_comparison(df, step3_df, 'feature b')

### $\color{purple}{\text{MICE imputer in}}$ `autoimpute`

In [None]:
imputer = MiceImputer(n=1, k=5, strategy='least squares')

In [None]:
# MICE imputer returns a multiple imputation (see next section) we unpack it by referencing [0][1]
imputed = [each for each in imputer.fit_transform(dmcar_df)][0][1]
dmcar_df.displayer(imputed, 20)

## $\color{purple}{\text{Advanced Imputation Techniques: Multiple Imputation}}$

Many of the imputation techniques are stochastic in nature meaning that if you run the imputation a second time. You would a slightly different imputed values for the missing values.

**Multiple Imputation** is the method which repeatedly imputes missing values. The result is a collection of possible imputed values.

With a collection of imputed values for each missing value, you can perform statistics and carry through error margins and confidence intervals through your models. 



We will use `autoimpute`'s multiple imputer to demonstrate, by default it produces 5 imputations. It returns this as a generator which we unpack using list.

In [None]:
imputer = MultipleImputer(strategy='least squares')
imputations = imputer.fit_transform(mar_df)
lists = list(imputations)  # Unscramble the generator

The return value is an array of tuples. Each tuple is a pair with the imputation index (ordinal count) and the imputed dataframe.

In [None]:
# Display the second full imputation
mar_df.displayer(lists[2][1], 10)

$\color{red}{\Large{\text{ ⚠}}}$ the `least squares` is option is deterministic. You will notice all the imputations are the same.

In [None]:
[each[1].iloc[1]['feature a'] for each in lists]

If we use the `stochastic` strategy each missing value will have multiple imputed values

In [None]:
imputer = MultipleImputer(strategy='stochastic')
imputations = imputer.fit_transform(mar_df)
lists = list(imputations)  # Unscramble the generator

In [None]:
[each[1].iloc[1]['feature a'] for each in lists]

### $\color{purple}{\text{Conclusion}}$

* Univariate Imputation is fast and easy but works only on MCAR
* Multivariate Imputation comes in two broad flavors
  * Regression/Classification
  * Hot Deck Imputation
* Multivariate Imputation with Chained Equations (MICE) deals well with multiple missing features
* Multiple Imputation can be used to carry statistics into your model


### $\color{purple}{\text{References}}$
* van Buuren, S., Groothuis-Oudshoorn, K.: mice: Multivariate imputation by
chained equations in r. _Journal of Statistical Software_, Articles 45(3), 1–67 (2011).
https://doi.org/10.18637/jss.v045.i03, https://www.jstatsoft.org/v045/i03