# $\color{purple}{\text{Missing Data in the Age of Machine Learning (Part 1)}}$

## $\color{purple}{\text{Regression in Imputation}}$

### $\color{purple}{\text{Libraries for this lesson}}$

In [None]:
import pandas as pd
import numpy as np
from sklearn.ensemble import RandomForestClassifier
from sklearn.linear_model import LinearRegression
from sklearn.ensemble import RandomForestRegressor

from autoimpute.imputations import SingleImputer
import matplotlib.pyplot as plt
import tensorflow as tf

In [None]:
df = pd.read_csv('data/full_set.csv')
mar_df = pd.read_csv('data/mar_set.csv')

## $\color{purple}{\text{Quick Look at the Data Set}}$

 * Full Set - Synthetic Normally Distributed Data Set
 * MAR Set - Data in the `feature a` column clobbered using an MAR mechanism
 * Double MAR Set - Data in the `feature a` and `feature b` column clobbered using an MAR mechanism

### $\color{purple}{\text{Assess the missingness}}$

In [None]:
mar_df.isnull().sum()

### $\color{purple}{\text{Compare Statistics}}$

### $\color{purple}{\text{Conventional Imputation: Stochastic Linear Regression}}$

In [None]:
linear_regressor = LinearRegression()

#### Perform the linear regresssion

We base the prediction of `feature a` on the remaining features in `rest`. We only run the regression on data with full rows, `full_data`.

In [None]:
rest = ['feature b', 'feature c', 'feature d', 'uncorrelated']
full_data = mar_df.dropna()
linear_regressor.fit(full_data[rest], full_data['feature a'])
predicted = linear_regressor.predict(mar_df[rest])

#### A note about a code pattern

I will be repeating the following code pattern or variation thereof. 

```.assign(**{'feature a': df['feature a'].where(~df['feature a'].isnull(), predicted)``` 

Depending on the use case, I'll either be filling in a value when the value is missing.

This basically substitutes the predicted value only when values are missing.

This is basically the same pattern as

```df['feature a'] = df['feature a'].where(~df['feature a'].isnull(), predicted)```

but allows for passing the dataframe or method chaining

In [None]:
imputed = mar_df.assign(
    **{
        'feature a':
        mar_df['feature a'].where(~mar_df['feature a'].isnull(), predicted)
    })

### $\color{purple}{\text{Analyze the Results}}$

#### $\color{purple}{\text{Adding Stochastic Element to Linear Regression}}$
* Extends Linear Regression by adding noise modelling the residuals
* Better simulates variance

We rely on the linear regression prediction above. And calculate the statistics behind the residuals of the linear regression.

In [None]:
residual = mar_df['feature a'] - predicted
residual.mean()
residual.std()

For the prediction we model the residual noise as a normal distribution and adjust predictions accordingly.

In [None]:
residual_noise = np.random.normal(residual.mean(), residual.std(), 20000)
predicted += residual_noise

In [None]:
imputed = mar_df.assign(
    **{
        'feature a':
        mar_df['feature a'].where(~mar_df['feature a'].isnull(), predicted)
    })

### $\color{purple}{\text{Analyze the Results}}$

### $\color{purple}{\text{Built into}}$ `autoimpute`

In [None]:
imputer = SingleImputer('least squares')
ls_imputations = imputer.fit_transform(mar_df)

In [None]:
from autoimpute.imputations import SingleImputer

imputer = SingleImputer('stochastic')
st_imputations = imputer.fit_transform(mar_df)

### $\color{purple}{\text{Analyze Results}}$

### $\color{purple}{\text{Random Forest Regression}}$
Let's use a Random Forest Regression instead

In [None]:
rf_regressor = RandomForestRegressor()
rest = ['feature b', 'feature c', 'feature d', 'uncorrelated']
full_data = mar_df.dropna()
rf_regressor.fit(full_data[rest], full_data['feature a'])
predicted = rf_regressor.predict(mar_df[rest])

In [None]:
imputed = mar_df.assign(
    **{
        'feature a':
        mar_df['feature a'].where(~mar_df['feature a'].isnull(), predicted)
    })

### $\color{purple}{\text{Analyze Results}}$


## $\color{purple}{\text{Out on the Fringe: Let's try an Artificial Neural Network}}$

Imputation of categorical variables employs classification in place of regression. Most common is multinomial logistic regression.

In [None]:
# Build the model
model = tf.keras.models.Sequential([
    tf.keras.layers.Dense(4, activation=tf.nn.tanh),
    tf.keras.layers.Dense(15, activation=tf.nn.tanh),
    tf.keras.layers.Dense(1)
])

In [None]:
# Compile the model
model.compile(optimizer='adam', loss='mse')

In [None]:
rest = ['feature b', 'feature c', 'feature d', 'uncorrelated']
full_data = mar_df.dropna()
history=model.fit(full_data[rest], full_data['feature a'], epochs=50, verbose=False)

In [None]:
plt.plot(history.history['loss'])

In [None]:
predicted = df.seriesmodel.predict(mar_df[rest])

In [None]:
predicted

In [None]:
imputed = mar_df.assign(
    **{
        'feature a':
        mar_df['feature a'].where(~mar_df['feature a'].isnull(), predicted[0])
    })

### $\color{purple}{\text{Analyze Results}}$

In [None]:
df.cov()

# $\color{purple}{\text{Missing Data in the Age of Machine Learning (Part 2)}}$

## $\color{purple}{\text{Neural Network Imputers}}$

### $\color{purple}{\text{Libraries for this lesson}}$

In [None]:
import matplotlib.pyplot as plt
import numpy as np
import pandas as pd
import tensorflow as tf
from autoimpute.imputations import SingleImputer
from sklearn.compose import ColumnTransformer
from sklearn.ensemble import RandomForestRegressor
from sklearn.model_selection import train_test_split
from sklearn.preprocessing import StandardScaler

### $\color{purple}{\text{Denoising Autoencoders}}$

* The missing data (or deviation from an imputed value) is treated as noise.
* Denoising autoencoders are neural networks trained on the same input and output.
* Theory is that the output is trained so that the output is the input with noise removed.
* To work properly, data should be normalized during the imputation.

`scaler` uses `sklearn`'s `StandardScaler`


In [None]:
df = pd.read_csv('data/full_set.csv')
dmar_df = pd.read_csv('data/double_mar_set.csv')

scaler = StandardScaler()
scaler.fit(dmar_df)
sdmar_df = pd.DataFrame(scaler.transform(dmar_df), columns=dmar_df.columns)

In [None]:
dmar_df.isnull().sum()

In [None]:
def restore_df(scaler, x):
    """
    Inverse the scaler and created a dataframe
    """
    return pd.DataFrame(scaler.inverse_transform(x), columns=dmar_df.columns)

The basic autoencoder proposed by [Gondara and Wang](https://arxiv.org/abs/1705.02737)
![](https://raw.githubusercontent.com/WestHealth/pydataglobal-2022/main/images/dae.svg)
* Deep neural network with 5 hidden layers with a dropout layer
* $\Theta$ is a hyperparameter governing the expansion and contraction of the layer
* $\Theta=7$ is suggested by best practice.
* In the first 3 hidden layers, each layer expands by $\Theta$ and contracts by $\Theta$ in the last 2 hidden layers.

#### Step 1 Impute the data set using univariate imputation
The recommendation is that mean or median imputation of numeric data and mode imputation of categorical data

In [None]:
univariate_imputed = SingleImputer('median').fit_transform(sdmar_df)

#### Step 2 Split data into training and test sets
This is only necessary if you are building a model that accepts future data (open configuration). If the data set is closed (i.e. you don't expect any new data) then you can set the test_size to 0

In [None]:
theta = 7
# Divide into training and test sets
training, test = train_test_split(univariate_imputed, test_size=0.3)

#### Step 3 Build, Compile and Train a Deep Neural Network Model
* theta and activation function are hyperparameters

See `tensorflow` and `keras` documentation for further detail

In [None]:
# Build the model
model = tf.keras.models.Sequential([
    tf.keras.layers.Dropout(0.2),
    tf.keras.layers.Dense(5 + theta, activation=tf.nn.tanh),
    tf.keras.layers.Dense(5 + 2 * theta, activation=tf.nn.tanh),
    tf.keras.layers.Dense(5 + 3 * theta, activation=tf.nn.tanh),
    tf.keras.layers.Dense(5 + 2 * theta, activation=tf.nn.tanh),
    tf.keras.layers.Dense(5 + theta, activation=tf.nn.tanh),
    tf.keras.layers.Dense(5)
])

In [None]:
# Compile the model
model.compile(optimizer='adam', loss='mse')

In [None]:
history = model.fit(training, training, epochs=50, verbose=False)

You can visualize the progress of the loss

In [None]:
plt.plot(history.history['loss'])

#### Step 4 Make Prediction based on initial imputation.
We replace the missing values with the predicted value. We also convert back to `pandas` `DataFrame`

In [None]:
predicted = pd.DataFrame(model.predict(univariate_imputed),
                         columns=dmar_df.columns)

In [None]:
# Don't forget to rescale the data after filling in missing data
imputed = restore_df(scaler, sdmar_df.combine_first(predicted))

### $\color{purple}{\text{Analyze the Results}}$

#### $\color{purple}{\text{Improved Feedback Denoising Autoencoders}}$

My own enhancement to the denoising autoencoder see [here](https://arxiv.org/abs/2002.08338)

The algorithm was designed for closed data sets. This example shows one enhancement to the denoising autoencoder (DAE), the iterative refinement of the imputed values. It starts similarly by univariate imputation as **step 1**.

In [None]:
univariate_imputation = SingleImputer('median').fit_transform(sdmar_df)

#### Step 2 Build and Compile Deep Neural Network Model
We use the same architecture as the DAE

In [None]:
# Build the model
model = tf.keras.models.Sequential([
    tf.keras.layers.Dropout(0.2),
    tf.keras.layers.Dense(5 + theta, activation=tf.nn.tanh),
    tf.keras.layers.Dense(5 + 2 * theta, activation=tf.nn.tanh),
    tf.keras.layers.Dense(5 + 3 * theta, activation=tf.nn.tanh),
    tf.keras.layers.Dense(5 + 2 * theta, activation=tf.nn.tanh),
    tf.keras.layers.Dense(5 + theta, activation=tf.nn.tanh),
    tf.keras.layers.Dense(5)
])

In [None]:
# Compile the model
model.compile(optimizer='adam', loss='mse')

#### Step 3 Initial Fit
Fewer epochs than standard DAE

In [None]:
history = [
    model.fit(univariate_imputation,
              univariate_imputation,
              epochs=10,
              verbose=False)
]

#### Step 4 Iteration


In [None]:
predicted = pd.DataFrame(model.predict(univariate_imputation),
                         columns=dmar_df.columns)
iterated_imputation = sdmar_df.combine_first(predicted)
history.append(
    model.fit(iterated_imputation,
              iterated_imputation,
              epochs=2,
              verbose=False))

#### Repeat the Iteration a Prescribed Number of Times

In [None]:
for _ in range(0, 19):
    predicted = pd.DataFrame(model.predict(iterated_imputation),
                             columns=dmar_df.columns)
    iterated_imputation = sdmar_df.combine_first(predicted)
    history.append(
        model.fit(iterated_imputation,
                  iterated_imputation,
                  epochs=2,
                  verbose=False))

In [None]:
# Since we collected history in several batches, concatenate them so we can see a plot
losses = sum([each.history['loss'] for each in history], [])
plt.plot(losses)

#### Plug the final prediction into the missing values and rescale the result

In [None]:
predicted = pd.DataFrame(model.predict(iterated_imputation),
                         columns=dmar_df.columns)
imputed = restore_df(scaler, sdmar_df.combine_first(predicted))

### $\color{purple}{\text{Analyze the Results}}$

#### $\color{purple}{\text{Another Improvement}}$

* The loss function can be adjusted to eliminate the influence of missing/imputed values.
* Most NN packages are not equipped to handle this, requires a complicated modification to the package. Beyond the scope of this tutorial.