# $\color{purple}{\text{Understanding Missing Data and How to Deal with It (Part 6)}}$

## $\color{purple}{\text{Missing Data in the Age of Machine Learning and Artifical Neural Network}}$

### $\color{purple}{\text{Colab Environmental Setup}}$

In [None]:
from google.colab import drive
import os
drive.mount('/content/drive')
os.chdir('/content/drive/My Drive/missingness_tutorial')

In [None]:
%pip install autoimpute

### $\color{purple}{\text{Libraries for this lesson}}$

In [None]:
import matplotlib.pyplot as plt
import numpy as np
import pandas as pd
import tensorflow as tf
from autoimpute.imputations import MiceImputer
from autoimpute.imputations import SingleImputer
from matplotlib.patches import Rectangle
from sklearn.compose import ColumnTransformer
from sklearn.ensemble import RandomForestRegressor
from sklearn.model_selection import train_test_split
from sklearn.pipeline import make_pipeline
from sklearn.preprocessing import FunctionTransformer
from sklearn.preprocessing import OneHotEncoder
from sklearn.preprocessing import StandardScaler

from helpers import ImputationDisplayer, stat_comparison

### $\color{purple}{\text{Neural Network Imputers}}$

#### $\color{purple}{\text{Denoising Autoencoders}}$

* The missing data (or deviation from an imputed value) is treated as noise.
* Denoising autoencoders are neural networks trained on the same input and output.
* Theory is that the output is trained so that the output is the input with noise removed.
* To work properly, data should be normalized during the imputation.

`scaler` uses `sklearn`'s `StandardScaler`


In [None]:
df = pd.read_csv('data/full_set.csv')
dmar_df = pd.read_csv('data/double_mar_set.csv')
ImputationDisplayer(dmar_df)
scaler = StandardScaler()
scaler.fit(dmar_df)
sdmar_df = pd.DataFrame(scaler.transform(dmar_df), columns=dmar_df.columns)

In [None]:
def restore_df(scaler, x):
    """
    Inverse the scaler and created a dataframe
    """
    return pd.DataFrame(scaler.inverse_transform(x), columns=dmar_df.columns)

The basic autoencoder proposed by [Gondara and Wang](https://arxiv.org/abs/1705.02737)
![](https://raw.githubusercontent.com/WestHealth/scipy2022-missingness-tutorial/main/images/dae.svg)
* Deep neural network with 5 hidden layers with a dropout layer
* $\Theta$ is a hyperparameter governing the expansion and contraction of the layer
* $\Theta=7$ is suggested by best practice.
* In the first 3 hidden layers, each layer expands by $\Theta$ and contracts by $\Theta$ in the last 2 hidden layers.

#### Step 1 Impute the data set using univariate imputation
The recommendation is that mean or median imputation of numeric data and mode imputation of categorical data

In [None]:
univariate_imputed = SingleImputer('median').fit_transform(sdmar_df)

#### Step 2 Split data into training and test sets
This is only necessary if you are building a model that accepts future data (open configuration). If the data set is closed (i.e. you don't expect any new data) then you can set the test_size to 0

In [None]:
theta = 7
# Divide into training and test sets
training, test = train_test_split(univariate_imputed, test_size=0.3)

#### Step 3 Build, Compile and Train a Deep Neural Network Model
* theta and activation function are hyperparameters

See `tensorflow` and `keras` documentation for further detail

In [None]:
# Build the model
model = tf.keras.models.Sequential([
    tf.keras.layers.Dropout(0.2),
    tf.keras.layers.Dense(5 + theta, activation=tf.nn.tanh),
    tf.keras.layers.Dense(5 + 2 * theta, activation=tf.nn.tanh),
    tf.keras.layers.Dense(5 + 3 * theta, activation=tf.nn.tanh),
    tf.keras.layers.Dense(5 + 2 * theta, activation=tf.nn.tanh),
    tf.keras.layers.Dense(5 + theta, activation=tf.nn.tanh),
    tf.keras.layers.Dense(5)
])

In [None]:
# Compile the model
model.compile(optimizer='adam', loss='mse')

In [None]:
history = model.fit(training, training, epochs=50, verbose=False)

You can visualize the progress of the loss

In [None]:
plt.plot(history.history['loss'])

#### Step 4 Make Prediction based on initial imputation.
We replace the missing values with the predicted value. We also convert back to `pandas` `DataFrame`

In [None]:
predicted = pd.DataFrame(model.predict(univariate_imputed),
                         columns=dmar_df.columns)

In [None]:
# Don't forget to rescale the data after filling in missing data
imputed = restore_df(scaler, sdmar_df.combine_first(predicted))

#### $\color{purple}{\text{Improved Feedback Denoising Autoencoders}}$

My own enhancement to the denoising autoencoder see [here](https://arxiv.org/abs/2002.08338)

The algorithm was designed for closed data sets. This example shows one enhancement to the denoising autoencoder (DAE), the iterative refinement of the imputed values. It starts similarly by univariate imputation as **step 1**.

In [None]:
univariate_imputation = SingleImputer('median').fit_transform(sdmar_df)

#### Step 2 Build and Compile Deep Neural Network Model
We use the same architecture as the DAE

In [None]:
# Build the model
model = tf.keras.models.Sequential([
    tf.keras.layers.Dropout(0.2),
    tf.keras.layers.Dense(5 + theta, activation=tf.nn.tanh),
    tf.keras.layers.Dense(5 + 2 * theta, activation=tf.nn.tanh),
    tf.keras.layers.Dense(5 + 3 * theta, activation=tf.nn.tanh),
    tf.keras.layers.Dense(5 + 2 * theta, activation=tf.nn.tanh),
    tf.keras.layers.Dense(5 + theta, activation=tf.nn.tanh),
    tf.keras.layers.Dense(5)
])

In [None]:
# Compile the model
model.compile(optimizer='adam', loss='mse')

#### Step 3 Initial Fit
Fewer epochs than standard DAE

In [None]:
history = [
    model.fit(univariate_imputation,
              univariate_imputation,
              epochs=10,
              verbose=False)
]

#### Step 4 Iteration


In [None]:
predicted = pd.DataFrame(model.predict(univariate_imputation),
                         columns=dmar_df.columns)
iterated_imputation = sdmar_df.combine_first(predicted)
history.append(
    model.fit(iterated_imputation,
              iterated_imputation,
              epochs=2,
              verbose=False))

#### Repeat the Iteration a Prescribed Number of Times

In [None]:
for _ in range(0, 19):
    predicted = pd.DataFrame(model.predict(iterated_imputation),
                             columns=dmar_df.columns)
    iterated_imputation = sdmar_df.combine_first(predicted)
    history.append(
        model.fit(iterated_imputation,
                  iterated_imputation,
                  epochs=2,
                  verbose=False))

In [None]:
# Since we collected history in several batches, concatenate them so we can see a plot
losses = sum([each.history['loss'] for each in history], [])
plt.plot(losses)

#### Plug the final prediction into the missing values and rescale the result

In [None]:
predicted = pd.DataFrame(model.predict(iterated_imputation),
                         columns=dmar_df.columns)
imputed = restore_df(scaler, sdmar_df.combine_first(predicted))

In [None]:
dmar_df.displayer(imputed, 10)

## $\color{purple}{\text{How Imputation Fits Into Your Machine Learning Models}}$

* Typical ML Workflow
  * Train
  * Test
  * Use
  
![](https://raw.githubusercontent.com/WestHealth/scipy2022-missingness-tutorial/main/images/typical.svg)

* Save Your Pipeline
* You can include an imputer in your pipeline


![](https://raw.githubusercontent.com/WestHealth/scipy2022-missingness-tutorial/main/images/imputer.svg)

* We demonstrate this using `sklearn`'s pipeline. But this is meant to describe abstractly what you should do

Same data set is taken from the [Wine Quality Dataset at UCI](https://archive.ics.uci.edu/ml/datasets/Wine+Quality)

This demonstrates a typical pipeline. The final column `quality` is the predicted value. The `features` variable contains all the other column names

In [None]:
training = pd.read_csv('data/original_wine_training.csv')
test = pd.read_csv('data/original_wine_test.csv')

In [None]:
features = list(training.columns[0:-1])

We build the pipeline by one-hot encoding the `type` column which is categorical, then scale it, then apply random forest regressor

In [None]:
pipeline = make_pipeline(
    ColumnTransformer([("type", OneHotEncoder(), ["type"])],
                      remainder='passthrough'), StandardScaler(),
    RandomForestRegressor())

In [None]:
pipeline.fit(training[features], training['quality'])

In [None]:
pipeline.score(test[features], test['quality'])

In [None]:
pipeline.predict(training[features])

### $\color{purple}{\text{Imputer in the Data Pipeline}}$

This is meant to demonstrate workflow and `autoimpute` is used as an example.

One drawback is that `autoimpute` imputers require a `pandas` `DataFrame` as an input so custom transformers need to be used.


In [None]:
pandas_hack = FunctionTransformer(
    lambda x: pd.DataFrame(x, columns=['type_r', 'type_w'] + features[0:-1]))
pandas_hack_full = FunctionTransformer(
    lambda x: pd.DataFrame(x, columns=['type_r', 'type_w'] + features))

We can insert the imputer into the pipeline

In [None]:
pipeline = make_pipeline(
    ColumnTransformer([("type", OneHotEncoder(), ['type'])],
                      remainder='passthrough'), pandas_hack,
    SingleImputer(strategy='least squares'), StandardScaler(),
    RandomForestRegressor())

In [None]:
wine_training = pd.read_csv('data/wine_training.csv')
pipeline.fit(wine_training[features], wine_training['quality'])

In [None]:
wine_test = pd.read_csv('data/wine_test.csv')
pipeline.score(wine_test[features], wine_test['quality'])

In [None]:
wine_future = pd.read_csv('data/wine_future.csv')
pipeline.predict(wine_future[features].iloc[0:10])

## $\color{purple}{\text{How does Multiple Imputation fit in?}}$
### $\color{purple}{\text{Approach 1: Augment Data with Multiple Copies}}$

![](https://raw.githubusercontent.com/WestHealth/scipy2022-missingness-tutorial/main/images/stack.svg)

Augmentation teachs the model that the imputed values are "fuzzy" by providing different values.

We create the same pipeline except we have a MiceImputer at the end.
The resultant `dfs` are 5 copies of our dataframe with 5 separate imputations

In [None]:
pipeline = make_pipeline(
    ColumnTransformer([("type", OneHotEncoder(), ['type'])],
                      remainder='passthrough'), StandardScaler(), pandas_hack,
    MiceImputer(k=5, strategy='stochastic'))
dfs = [each[1] for each in pipeline.fit_transform(wine_training[features])]
len(dfs)

We augment the training set by concatenating the 5 different data frame. Equivalently, we could rotate each epoch with different imputations.

In [None]:
augmented_training = pd.concat(dfs)

Build out model as a classification problem. Bear in mind this is just for demonstration purposes, we model is not a particularly good estimator

In [None]:
model = tf.keras.models.Sequential([
    tf.keras.layers.Dropout(0.2),
    tf.keras.layers.Dense(20, activation=tf.nn.tanh),
    tf.keras.layers.Dense(20, activation=tf.nn.tanh),
    tf.keras.layers.Dense(10, activation=tf.nn.softmax)
])
model.compile(optimizer='adam',
              loss='sparse_categorical_crossentropy',
              metrics=['accuracy'])

In [None]:
quality = pd.concat([
    wine_training.quality, wine_training.quality, wine_training.quality,
    wine_training.quality, wine_training.quality
])

In [None]:
model.fit(augmented_training, quality, epochs=10, verbose=False)

Since test set is used to evaluate when the model may hit overtraining, it is not necessary to multiply impute the tests. But you may.

In [None]:
test_dfs = [each[1] for each in pipeline.transform(wine_test[features])]
test1 = test_dfs[0]  # Variation one, just take one imputation
test2 = pd.concat(test_dfs)  # Variation two augment in the same way
quality1 = wine_test.quality
quality2 = pd.concat([quality1, quality1, quality1, quality1, quality1])

#### What about future values?
Same options:
 * take one imputation
 * run all imputations through the model and us an ensemble technique to combine (e.g., majority voting)

In [None]:
future_dfs = [
    each[1].iloc[[0]] for each in pipeline.transform(wine_future[features])
]
future1 = future_dfs[0]  # Variation one, just take one imputation
future2 = pd.concat(future_dfs)  # Variation two augment in the same way

In [None]:
# Variation 1: Pick First Imputed
np.argmax(model.predict(future1))

In [None]:
# Variation 2: Aggregate all Imputed
np.argmax(model.predict(future2).sum(axis=0))

### $\color{purple}{\text{Approach 2: Combine Multiple Models}}$

![](https://raw.githubusercontent.com/WestHealth/scipy2022-missingness-tutorial/main/images/ensemble.svg)

In [None]:
models = [
    tf.keras.models.Sequential([
        tf.keras.layers.Dropout(0.2),
        tf.keras.layers.Dense(20, activation=tf.nn.tanh),
        tf.keras.layers.Dense(20, activation=tf.nn.tanh),
        tf.keras.layers.Dense(10, activation=tf.nn.softmax)
    ]) for _ in range(0, 5)
]
for model in models:
    model.compile(optimizer='adam',
                  loss='sparse_categorical_crossentropy',
                  metrics=['accuracy'])

Each model is trained on a different imputation model. Each model should be tested under each imputation model

In [None]:
for model, training in zip(models, dfs):
    model.fit(training, wine_training.quality, epochs=10, verbose=False)

In [None]:
for model, test in zip(models, test_dfs):
    print(model.evaluate(test, wine_test.quality, verbose=False))

In [None]:
# np vstack turns the list of arrays into an array of arrays
predictions = np.vstack(
    [model.predict(future) for model, future in zip(models, future_dfs)])

In [None]:
# Aggregate predictions
np.argmax(predictions.sum(axis=0))

### $\color{purple}{\text{Worth Mentioning: Use Single Imputation with Bagging}}$
Short for bootstrap and aggregation

Rather than multiple imputation, single imputations are performed from resampled datasets (bootstrapping)

![](https://raw.githubusercontent.com/WestHealth/scipy2022-missingness-tutorial/main/images/single.svg)

In [None]:
pipeline = make_pipeline(
    ColumnTransformer([("type", OneHotEncoder(), ['type'])],
                      remainder='passthrough'), StandardScaler(),
    pandas_hack_full, MiceImputer(k=1, strategy='stochastic'))
bagged_dfs = [
    next(pipeline.fit_transform(wine_training.sample(frac=1, replace=True)))[1]
    for _ in range(0, 5)
]

In [None]:
models = [
    tf.keras.models.Sequential([
        tf.keras.layers.Dropout(0.2),
        tf.keras.layers.Dense(20, activation=tf.nn.tanh),
        tf.keras.layers.Dense(20, activation=tf.nn.tanh),
        tf.keras.layers.Dense(10, activation=tf.nn.softmax)
    ]) for _ in range(0, 5)
]
for model in models:
    model.compile(optimizer='adam',
                  loss='sparse_categorical_crossentropy',
                  metrics=['accuracy'])

In [None]:
for model, training in zip(models, dfs):
    model.fit(training, wine_training.quality, epochs=10, verbose=False)

In [None]:
predictions = np.vstack(
    [model.predict(future) for model, future in zip(models, future_dfs)])
# Aggregate predictions
np.argmax(predictions.sum(axis=0))

### $\color{purple}{\text{Worth Mentioning: Use missingness as a feature}}$

The idea is that you are giving information to the model as to which values are imputed.

In [None]:
pipeline = make_pipeline(
    ColumnTransformer([("type", OneHotEncoder(), ['type'])],
                      remainder='passthrough'), StandardScaler(), pandas_hack,
    SingleImputer(strategy='stochastic'))
processed = pipeline.fit_transform(wine_test[features])

We add a feature for each feature that has missing values indicating whether the corresponding row entry is missing that feature

In [None]:
for each in features:
    processed[f'{each}_missing'] = wine_test[each].isnull().astype(int)

In [None]:
processed

## $\color{purple}{\text{Conclusion}}$


* The bulk of work with models dealing with missing data uses decision tree derivative models
* Neural Networks can be used for imputation
* Several strategies for integrating imputation into model building pipelines
  * Imputer should be a processing step (important that models are saveable)
  * Multiple Imputation can use data augmentation or multiple models
  * Bagging can be applied to single imputation performed multiply
  * Missingness (or imputed) can be added as a flag.

## $\color{purple}{\text{References}}$
* Gondara, L., Wang, K.: Mida: Multiple imputation using denoising autoencoders.
In: Phung, D., Tseng, V.S., Webb, G.I., Ho, B., Ganji, M., Rashidi, L. (eds.)
_Advances in Knowledge Discovery and Data Mining_. pp. 260–272. Springer International Publishing, Cham (2018)
* Jiang, W., Josse, J., Lavielle, M: Logistic regression with missing covariates–parameter estimation, model selection and prediction. _Computational and Statistics Analysis_, 2019.
* Lu, H.-m., Perrone, G., & Unpingco, J.: Multiple imputation with denoising autoencoder using metamorphic truth and imputation feedback, _Machine Learning and Data Mining in Pattern Recognition, 16th International
Conference on Machine Learning and Data Mining, MLDM 2020_,Amsterdam, The Netherlands, July 20-21, 2020, Proceedings,
pages 197–208.
* Perez-Lebel, A., Varoquaux, G., Le Morvan, M., Josse, J., Poline, J.-B.: Benchmarking missing-values approaches for predictive models on health databases, _GigaScience_, Volume 11, 2022, https://doi.org/10.1093/gigascience/giac013
* Khan, S., Ahmad, A., Mihailidis, A.: Bootstrapping and Multiple Imputation Ensemble Approaches for Missing Data, _Journal of Intelligent and Fuzzy Systems_, 2019. https://doi.org/10.48550/arXiv.1802.00154