### Author: Ally Sprik
### Last-updated: 25-02-2024

Goal of this notebook is to explore data imputation with the MIDAS algorithm. A deep learning autoencoder that is able to handle both continuous and categorical data. The algorithm is able to generate multiple imputations for missing data. 


In [None]:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
import MIDASpy as midas

pd.options.mode.copy_on_write = True  # This will allow the code to run faster and keep Pandas happy. Technical detail: https://pandas.pydata.org/pandas-docs/stable/user_guide/copy_on_write.html#

df = pd.read_csv('../../0. Source_files/0.2. Cleaned_data/Training_JAMA+Brno_model_cleaned.csv', sep=',')
extra_cols = df[["Study_number", "Included_in_training_cohort", "Comorbidity_index", "Platelets_numeric", "CA125_PREOP", "Age", "BMI"]].copy()
df = df.drop(["Study_number", "Included_in_training_cohort", "Comorbidity_index", "Platelets_numeric", "CA125_PREOP", "Age", "BMI"], axis=1)

for col in df.columns.values:
    for i in range(len(df)):
        if pd.isna(df[col][i]):
            df.loc[i, col] = np.nan
            
encoded, cat_cols_list = midas.cat_conv(df)

CUDA_VISIBLE_DEVICES=""

Set up and build the imputation mode

The imputation model is a deep learning autoencoder that is able to handle both continuous and categorical data. The algorithm is able to generate multiple imputations for missing data.

Parameters:
- layer_structure: list of integers, the number of nodes in each layer of the autoencoder
- vae_layer: boolean, whether to include a variational autoencoder layer
- seed: int, random seed
- input_drop: float, the dropout rate for the input layer
- training_epochs: int, the number of epochs to train the model

In [None]:
imputer = midas.Midas(layer_structure=[256,256], vae_layer=True, seed=123, input_drop=0.75)
imputer.build_model(encoded)
imputer.train_model(training_epochs=15)

Impute the data

In [None]:
imputations = imputer.generate_samples(m=10).output_list

Reapply the categorical labels

Pseudocode:
- Flatten the list of categorical columns
- Create a list of the column names
- for each imputation:
    - for each categorical column:
        - create a new column with the index of the maximum value
        - drop the original categorical columns

In [None]:
flat_cats = [cat for variable in cat_cols_list for cat in variable]
categorical = df.columns.values

for i in range(len(imputations)):
    tmp_cat = [imputations[i][x].idxmax(axis=1) for x in cat_cols_list]
    cat_df = pd.DataFrame({categorical[i]:tmp_cat[i] for i in range(len(categorical))})
    imputations[i] = pd.concat([imputations[i], cat_df], axis = 1).drop(flat_cats, axis = 1)

Reapply the column names

Pseudocode:
- for each imputation:
    - for each column:
        - remove the column name prefix

In [None]:
for i in range(0,10):
    imputation = imputations[i]
    for col in imputation.columns.values:
        for j in range(len(imputation)):
            imputations.loc[i, col].loc[j] = imputation[col][j].removeprefix(col + '_')

Save the last imputation

Psuedocode:
- for each extra column:
    - for each row:
        - add the value back into the result dataframe - This keeps the extra columns in the final dataframe even though they were not imputed

In [None]:
result = imputations[9]
for col in extra_cols.columns.values:
    for i in range(len(extra_cols)):
        result.loc[i, col] = extra_cols[col][i]
            
result.to_csv('../../0. Source_files/0.3. Imputed_data/MIDAS_Imputed_TCGATraining_JAMA_Brno.csv', index=False)