## $\color{purple}{\text{Missing Data in the Age of Machine Learning and Artifical Neural Network (part 2)}}$

## $\color{purple}{\text{Imputation in Machine Learning and Multiple Imputation}}$

## $\color{purple}{\text{How Imputation Fits Into Your Machine Learning Models}}$

### Typical Machine Learning Workflow: 
  * Encoding
  * Preprocessing
  * Train
  * Test
  * Use
  
![](https://raw.githubusercontent.com/WestHealth/pydataglobal-2022/main/images/typical.svg)


### Machine Learning Workflow with Imputation: 

  * Encoding
  * Preprocessing
  * *Imputater*
  * Train
  * Test
  * Use


![](https://raw.githubusercontent.com/WestHealth/pydataglobal-2022/main/images/imputer.svg)


* We demonstrate this using `sklearn`'s pipeline. But this is meant to describe abstractly what you should do

Same data set is taken from the [Wine Quality Dataset at UCI](https://archive.ics.uci.edu/ml/datasets/Wine+Quality)

This demonstrates a typical pipeline. The final column `quality` is the predicted value (ranging from 1 - 9). The `features` variable contains all the other column names

### $\color{purple}{\text{Libraries for this lesson}}$

In [1]:
import matplotlib.pyplot as plt
import numpy as np
import pandas as pd
import tensorflow as tf
from autoimpute.imputations import MiceImputer
from autoimpute.imputations import SingleImputer
from matplotlib.patches import Rectangle
from sklearn.compose import ColumnTransformer
from sklearn.ensemble import RandomForestRegressor
from sklearn.model_selection import train_test_split
from sklearn.pipeline import make_pipeline
from sklearn.preprocessing import FunctionTransformer
from sklearn.preprocessing import OneHotEncoder
from sklearn.preprocessing import StandardScaler

In [2]:
training = pd.read_csv('data/original_wine_training.csv')
test = pd.read_csv('data/original_wine_test.csv')

In [3]:
training

Unnamed: 0,fixed acidity,volatile acidity,citric acid,residual sugar,chlorides,free sulfur dioxide,total sulfur dioxide,density,pH,sulphates,alcohol,type,quality
0,8.0,0.500,0.39,2.60,0.082,12.0,46.0,0.99850,3.43,0.62,10.7,red,6
1,6.6,0.280,0.28,8.50,0.052,55.0,211.0,0.99620,3.09,0.55,8.9,white,6
2,7.0,0.190,0.23,5.70,0.123,27.0,104.0,0.99540,3.04,0.54,9.4,white,6
3,7.4,0.200,0.37,16.95,0.048,43.0,190.0,0.99950,3.03,0.42,9.2,white,6
4,7.8,0.280,0.34,1.60,0.028,32.0,118.0,0.99010,3.00,0.38,12.1,white,7
...,...,...,...,...,...,...,...,...,...,...,...,...,...
4995,9.8,0.300,0.39,1.70,0.062,3.0,9.0,0.99480,3.14,0.57,11.5,red,7
4996,8.3,0.845,0.01,2.20,0.070,5.0,14.0,0.99670,3.32,0.58,11.0,red,4
4997,7.1,0.360,0.28,2.40,0.036,35.0,115.0,0.98936,3.19,0.44,13.5,white,7
4998,6.6,0.240,0.27,15.80,0.035,46.0,188.0,0.99820,3.24,0.51,9.2,white,5


In [4]:
features = list(training.columns[0:-1])
features

['fixed acidity',
 'volatile acidity',
 'citric acid',
 'residual sugar',
 'chlorides',
 'free sulfur dioxide',
 'total sulfur dioxide',
 'density',
 'pH',
 'sulphates',
 'alcohol',
 'type']

### Typical ML Workflow

We build the pipeline by one-hot encoding the `type` column which is categorical, then scale it, then apply random forest regressor

In [5]:
pipeline = make_pipeline(
    ColumnTransformer([("type", OneHotEncoder(), ["type"])],remainder='passthrough'), 
    StandardScaler(),
    RandomForestRegressor())

In [6]:
pipeline.fit(training[features], training['quality'])

In [7]:
pipeline.score(test[features], test['quality'])

0.5401247706611145

### ML Workflow with Imputer

Note: This is meant to demonstrate workflow and `autoimpute` is used as an example.

One drawback is that `autoimpute` imputers require a `pandas` `DataFrame` as an input so custom transformers need to be used.


In [8]:
pandas_hack = FunctionTransformer(
    lambda x: pd.DataFrame(x, columns=['type_r', 'type_w'] + features[0:-1]))
pandas_hack_full = FunctionTransformer(
    lambda x: pd.DataFrame(x, columns=['type_r', 'type_w'] + features))

We can insert the imputer into the pipeline

In [9]:
pipeline = make_pipeline(
    ColumnTransformer([("type", OneHotEncoder(), ['type'])],remainder='passthrough'), 
    pandas_hack,
    SingleImputer(strategy='least squares'), 
    StandardScaler(),
    RandomForestRegressor())

In [10]:
wine_training = pd.read_csv('data/wine_training.csv')
pipeline.fit(wine_training[features], wine_training['quality'])

In [11]:
wine_test = pd.read_csv('data/wine_test.csv')
pipeline.score(wine_test[features], wine_test['quality'])

0.46757083737038485

In [12]:
wine_future = pd.read_csv('data/wine_future.csv')
pipeline.predict(wine_future[features].iloc[0:10])

array([6.61, 6.49, 5.14, 4.73, 5.64, 5.08, 4.73, 5.34, 5.52, 6.32])

## $\color{purple}{\text{How does Multiple Imputation fit in?}}$
### $\color{purple}{\text{Approach 1: Augment Data with Multiple Copies}}$

![](https://raw.githubusercontent.com/WestHealth/pydataglobal-2022/main/images/stack.svg)

Augmentation teachs the model that the imputed values are "fuzzy" by providing different values.

We create the same pipeline except we have a MiceImputer at the end.
The resultant `dfs` are 5 copies of our dataframe with 5 separate imputations

In [13]:
pipeline = make_pipeline(
    ColumnTransformer([("type", OneHotEncoder(), ['type'])],remainder='passthrough'), 
    StandardScaler(), 
    pandas_hack,
    MiceImputer(k=5, strategy='stochastic'))

dfs = [each[1] for each in pipeline.fit_transform(wine_training[features])]
print('Number of dataframe: ' + str(len(dfs)))

dfs[0]

Number of dataframe: 5


Unnamed: 0,type_r,type_w,fixed acidity,volatile acidity,citric acid,residual sugar,chlorides,free sulfur dioxide,total sulfur dioxide,density,pH,sulphates,alcohol
0,1.753562,-1.753562,0.596969,0.989712,0.494524,-0.601330,0.710839,-1.061784,-1.227729,1.253843,1.311354,0.587923,0.178056
1,-0.570268,0.570268,-0.468202,0.008971,-0.276098,0.631424,-0.111443,1.399135,1.687875,0.494073,-0.801990,0.120493,-1.335505
2,-0.570268,0.570268,-0.163867,-0.901984,-0.626381,0.046388,1.834623,-0.203323,-0.202850,0.229805,0.289512,0.053718,-0.915071
3,-0.570268,0.570268,0.140467,-0.840961,0.354411,2.396979,-0.221080,0.712367,1.316798,1.584178,-1.174933,-0.747590,-1.083245
4,-0.570268,0.570268,0.444802,-0.352782,0.144241,-0.810271,-0.769268,0.082830,0.044534,-1.520970,-1.361404,-1.014692,1.355271
...,...,...,...,...,...,...,...,...,...,...,...,...,...
4995,1.753562,-1.753562,1.966474,-0.230737,0.494524,-0.789377,0.162651,-1.576860,-1.881532,0.031604,-0.491204,0.254044,0.850750
4996,1.753562,-1.753562,0.825220,3.094987,-2.167626,-0.684907,0.381926,-1.462398,-1.793180,0.659240,0.627625,0.320820,0.430316
4997,-0.570268,0.570268,-0.087784,0.135398,-0.276098,-0.643118,-0.549993,0.254522,-0.008477,-1.765417,-0.180418,0.588223,2.532485
4998,-0.570268,0.570268,-0.468202,-0.596872,-0.346155,1.150729,-0.577402,0.884059,0.723583,1.154742,1.431114,-0.146609,-1.083245


We augment the training set by concatenating the 5 different data frame. Equivalently, we could rotate each epoch with different imputations.

In [14]:
augmented_training = pd.concat(dfs)

quality = pd.concat([
    wine_training.quality, wine_training.quality, wine_training.quality,
    wine_training.quality, wine_training.quality
])

In [15]:
augmented_training

Unnamed: 0,type_r,type_w,fixed acidity,volatile acidity,citric acid,residual sugar,chlorides,free sulfur dioxide,total sulfur dioxide,density,pH,sulphates,alcohol
0,1.753562,-1.753562,0.596969,0.989712,0.494524,-0.601330,0.710839,-1.061784,-1.227729,1.253843,1.311354,0.587923,0.178056
1,-0.570268,0.570268,-0.468202,0.008971,-0.276098,0.631424,-0.111443,1.399135,1.687875,0.494073,-0.801990,0.120493,-1.335505
2,-0.570268,0.570268,-0.163867,-0.901984,-0.626381,0.046388,1.834623,-0.203323,-0.202850,0.229805,0.289512,0.053718,-0.915071
3,-0.570268,0.570268,0.140467,-0.840961,0.354411,2.396979,-0.221080,0.712367,1.316798,1.584178,-1.174933,-0.747590,-1.083245
4,-0.570268,0.570268,0.444802,-0.352782,0.144241,-0.810271,-0.769268,0.082830,0.044534,-1.520970,-1.361404,-1.014692,1.355271
...,...,...,...,...,...,...,...,...,...,...,...,...,...
4995,1.753562,-1.753562,1.966474,-0.230737,0.494524,-0.789377,0.162651,-1.576860,-1.881532,0.031604,-0.491204,0.254044,0.850750
4996,1.753562,-1.753562,0.825220,3.094987,-2.167626,-0.684907,0.381926,-1.462398,-1.793180,0.659240,0.627625,0.320820,0.430316
4997,-0.570268,0.570268,-0.087784,0.135398,-0.276098,-0.643118,-0.549993,0.254522,-0.008477,-1.765417,-0.180418,-0.563210,2.532485
4998,-0.570268,0.570268,-0.468202,-0.596872,-0.346155,1.326150,-0.577402,0.884059,1.291953,1.154742,0.894511,-0.146609,-1.083245


Since test set is used to evaluate when the model may hit overtraining, it is not necessary to multiply impute the tests. But you may.

In [16]:
test_dfs = [each[1] for each in pipeline.transform(wine_test[features])]

test1 = test_dfs[0]  # Variation one, just take one imputation
quality1 = wine_test.quality

test2 = pd.concat(test_dfs)  # Variation two augment in the same way
quality2 = pd.concat([quality1, quality1, quality1, quality1, quality1])

In [17]:
pipeline = make_pipeline(
    StandardScaler(),
    RandomForestRegressor())

In [18]:
pipeline.fit(augmented_training, quality)

In [19]:
pipeline.score(test1, quality1)

0.4577084922385537

In [20]:
pipeline.score(test2, quality2)

0.46220538972149505

Here we show an additional example of using neural networks to predict the imputed dataset. The network model was built as a classification problem. Bear in mind this is just for demonstration purposes, our model is not a particularly good estimator

In [21]:
# Neural Network Approach
model = tf.keras.models.Sequential([
    tf.keras.layers.Dropout(0.2),
    tf.keras.layers.Dense(20, activation=tf.nn.tanh),
    tf.keras.layers.Dense(20, activation=tf.nn.tanh),
    tf.keras.layers.Dense(10, activation=tf.nn.softmax)
])

model.compile(optimizer='adam',
              loss='sparse_categorical_crossentropy',
              metrics=['accuracy'])

2022-12-01 16:44:32.377750: I tensorflow/core/platform/cpu_feature_guard.cc:151] This TensorFlow binary is optimized with oneAPI Deep Neural Network Library (oneDNN) to use the following CPU instructions in performance-critical operations:  SSE4.1 SSE4.2 AVX AVX2 FMA
To enable them in other operations, rebuild TensorFlow with the appropriate compiler flags.
2022-12-01 16:44:32.386988: I tensorflow/core/common_runtime/process_util.cc:146] Creating new thread pool with default inter op setting: 2. Tune using inter_op_parallelism_threads for best performance.


In [22]:
model.fit(augmented_training, quality, epochs=10, verbose=False)

<keras.callbacks.History at 0x7f44d82e6760>

In [23]:
model.evaluate(test1, quality1)



[1.0315864086151123, 0.5329999923706055]

In [24]:
model.evaluate(test2, quality2)



[1.0339940786361694, 0.5332000255584717]

#### What about future values?
Same options:
 * take one imputation
 * run all imputations through the model and use an ensemble technique to combine (e.g., majority voting)

In [25]:
pipeline = make_pipeline(
    ColumnTransformer([("type", OneHotEncoder(), ['type'])],remainder='passthrough'), 
    StandardScaler(), 
    pandas_hack,
    MiceImputer(k=5, strategy='stochastic'))

future_dfs = [
    each[1].iloc[[0]] for each in pipeline.fit_transform(wine_future[features])
]

future1 = future_dfs[0]  # Variation one, just take one imputation
future2 = pd.concat(future_dfs)  # Variation two augment in the same way

In [26]:
model.predict(future1)

array([[4.9242957e-05, 3.3287535e-05, 2.4239307e-05, 9.2014414e-04,
        4.7006989e-03, 1.5112912e-02, 2.2790848e-01, 5.7404137e-01,
        1.7503612e-01, 2.1735001e-03]], dtype=float32)

In [27]:
# Variation 1: Pick First Imputed
# Argmax was used because the model predicted one-hot encoded label (i.e. 000000100 for 7)

np.argmax(model.predict(future1))

7

It is worthy to mention that here we simply take the sum to aggregate 5 imputation results. In practice, you can use other more robusted methods such as majority vote, averaging, etc. 

In [28]:
# Variation 2: Aggregate all Imputed
np.argmax(model.predict(future2).sum(axis=0))

7

### $\color{purple}{\text{Approach 2: Combine Multiple Models}}$

![](https://raw.githubusercontent.com/WestHealth/pydataglobal-2022/main/images/ensemble.svg)

In [29]:
models = [
    tf.keras.models.Sequential([
        tf.keras.layers.Dropout(0.2),
        tf.keras.layers.Dense(20, activation=tf.nn.tanh),
        tf.keras.layers.Dense(20, activation=tf.nn.tanh),
        tf.keras.layers.Dense(10, activation=tf.nn.softmax)
    ]) for _ in range(0, 5)
]

for model in models:
    model.compile(optimizer='adam',
                  loss='sparse_categorical_crossentropy',
                  metrics=['accuracy'])

Each model is trained on a different imputation model. Each model should be tested under each imputation model

In [30]:
# Reminder: dfs stores a list of 5 imputed dataframes that was done indepdently

for model, training in zip(models, dfs):
    model.fit(training, wine_training.quality, epochs=10, verbose=False)

In [31]:
for model, test in zip(models, test_dfs):
    print(model.evaluate(test, wine_test.quality, verbose=False))

[1.0760548114776611, 0.5329999923706055]
[1.075652003288269, 0.5199999809265137]
[1.0803433656692505, 0.5040000081062317]
[1.063344955444336, 0.5260000228881836]
[1.0697463750839233, 0.5429999828338623]


In [32]:
# np vstack turns the list of arrays into an array of arrays
predictions = np.vstack(
    [model.predict(future) for model, future in zip(models, future_dfs)])



In [33]:
# Aggregate predictions
np.argmax(predictions.sum(axis=0))

7

### $\color{purple}{\text{Worth Mentioning: Use Single Imputation with Bagging}}$
Short for bootstrap and aggregation

Rather than multiple imputation, single imputations are performed from resampled datasets (bootstrapping)

![](https://raw.githubusercontent.com/WestHealth/pydataglobal-2022/main/images/single.svg)

In [34]:
pipeline = make_pipeline(
    ColumnTransformer([("type", OneHotEncoder(), ['type'])],remainder='passthrough'), 
    StandardScaler(),
    pandas_hack_full, 
    MiceImputer(k=1, strategy='stochastic')
)

bagged_dfs = [
    next(pipeline.fit_transform(wine_training.sample(frac=1, replace=True)))[1]
    for _ in range(0, 5)
]

In [35]:
models = [
    tf.keras.models.Sequential([
        tf.keras.layers.Dropout(0.2),
        tf.keras.layers.Dense(20, activation=tf.nn.tanh),
        tf.keras.layers.Dense(20, activation=tf.nn.tanh),
        tf.keras.layers.Dense(10, activation=tf.nn.softmax)
    ]) for _ in range(0, 5)
]

for model in models:
    model.compile(optimizer='adam',
                  loss='sparse_categorical_crossentropy',
                  metrics=['accuracy'])

In [36]:
for model, training in zip(models, dfs):
    model.fit(training, wine_training.quality, epochs=10, verbose=False)

In [37]:
predictions = np.vstack(
    [model.predict(future) for model, future in zip(models, future_dfs)])
# Aggregate predictions
np.argmax(predictions.sum(axis=0))

7

## $\color{purple}{\text{Conclusion}}$

* In a typical machine learning pipeline, imputation should be done before feeding into the machine learning model 

* Several strategies for multiple imputations:
  * Augmenting dataset multiple times and predicted by the same imputer 
  * Perform imputation multiple times with independent imputer
  * Bagging can be applied to single imputation performed multiply
