### Author: Ally Sprik
### Last-updated: 25-02-2024

Goal of this notebook is to utilise MIDASpy to impute the missing values in the MAYO dataset, using PIPENDO. The imputed dataset will be used in the next steps of the analysis.

In [None]:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
import random
import MIDASpy as midas
import tensorflow as tf

tf.config.set_visible_devices([], 'GPU')
# get gpu available
print("Num GPUs Available: ", len(tf.config.list_physical_devices('GPU')))

pd.options.mode.copy_on_write = True  # This will allow the code to run faster and keep Pandas happy. Technical detail: https://pandas.pydata.org/pandas-docs/stable/user_guide/copy_on_write.html#


In [None]:
df_MAYO = pd.read_csv("../../0. Source_files/0.2. Cleaned_data/MAYO_cleaned_model.csv")
df_PIP = pd.read_csv('../../0. Source_files/0.2. Cleaned_data/Casper_PIPENDO_Cleaned.csv', index_col='Unnamed: 0')

Select the columns to use as imputation evidence. The columns are the same for both datasets, so we can use the same columns for both datasets.

In [None]:
evidence_columns = ["ER", "PR", "p53", "L1CAM", "CA125", "Platelets", "PreoperativeGrade","LNM", "LVSI", "Chemotherapy", "Radiotherapy", "Survival1yr", "Survival3yr", "Survival5yr", "Cytology"]


df_MAYO = df_MAYO[evidence_columns]
df_PIP = df_PIP[evidence_columns]

# Concatenate the two datasets first MAYO then PIPENDO
data = pd.concat([df_MAYO, df_PIP], axis=0, ignore_index=True).replace({0:'no', 1:'yes'})
data

Set the data to categorical

In [None]:
for column in data.columns:
    data[column] = data[column].astype('category')


Encode the data

In [None]:
encoded, cat_cols_list = midas.cat_conv(data)

Create the MIDAS model and train it

Parameters:
- layer_structure: list of integers, the number of nodes in each layer of the autoencoder
- vae_layer: boolean, whether to include a variational autoencoder layer
- seed: int, random seed
- input_drop: float, the dropout rate for the input layer
- training_epochs: int, the number of epochs to train the model


In [None]:
imputer = midas.Midas(layer_structure=[256,256,256], vae_layer=True, seed=123, input_drop=0.90, latent_space_size=64, vae_sample_var=0.8, vae_alpha=1)

imputer.build_model(encoded, softmax_columns=cat_cols_list)
imputer.train_model(training_epochs=100)

Generate the imputations

In [None]:
imputations = imputer.generate_samples(m=10).output_list

Decode the imputations

Pseudocode:
- Flatten the list of imputations
- Create a list of the categorical columns
- for each imputation:
    - create a dataframe with the imputed values
    - remove the original categorical columns
    - add the new categorical columns
- for each imputation:
    - for each column in the imputation:
        - for each row in the column:
            - remove the prefix of the column name
- Retrieve the last imputation as the completed data

In [None]:
flat_cats = [cat for variable in cat_cols_list for cat in variable]
categorical = data.columns.values

for i in range(len(imputations)):
    tmp_cat = [imputations[i][x].idxmax(axis=1) for x in cat_cols_list]
    cat_df = pd.DataFrame({categorical[i]: tmp_cat[i] for i in range(len(categorical))})
    imputations[i] = pd.concat([imputations[i], cat_df], axis=1).drop(flat_cats, axis=1)
for i in range(0, 10):
    imputation = imputations[i]
    for col in imputation.columns.values:
        for j in range(len(imputation)):
            imputations.loc[i, col].loc[j] = imputation[col][j].removeprefix(col + '_')
            
completed_data = imputations[9]


Split the imputed data back into the original datasets, check if the imputed data is the same as the original data where there were values

Pseudocode:
- Define the mayo part as the rows up until the length of the original MAYO dataset
- Define the PIP part as the rows from the length of the original MAYO dataset to the end of the imputed data
- for each column in the evidence columns:
    - create a temporary dataframe with the column from the original MAYO dataset, missing values removed
    - get the index of the non-missing values
    - get the corresponding values from the MAYO part of the imputed data
    - compare if the two dataframes are the same

In [None]:
MAYO_part = completed_data.iloc[:len(df_MAYO), :]
PIP_part = completed_data.iloc[len(df_MAYO):, :]

for col in evidence_columns:
    temp = df_MAYO[col].dropna()
    index = temp.index
    temppart = MAYO_part[col].iloc[index]
    
    # Compare if its the same
    if (temp == temppart).all():
        print(f"{col} is the same")
    else:
        print(f"{col} is not the same")

Load the original data and add the imputed ca125 to it

Pseudocode:
- Load the original MAYO data
- Add the imputed CA125 to the original data
- For each column in the imputed data:
    - if the column is not in the original data:
        - add the column to the original data

In [None]:
MAYO_w_CA125 = pd.read_csv("../0. Source_files/0.2. Cleaned_data/MAYO_subdag.csv")
MAYO_w_CA125['CA125'] = MAYO_part['CA125']

for col in MAYO_w_CA125.columns:
    if col not in MAYO_part.columns:
        MAYO_part[col] = MAYO_w_CA125[col]


Save the imputed data

In [None]:
MAYO_w_CA125.to_csv('../../0. Source_files/0.3. Imputed_data/MayoCA125_wPIP_MidasPy.csv', index=False)
MAYO_part.to_csv('../../0. Source_files/0.3. Imputed_data/Mayo_wPIP_fullimp_MidasPy.csv', index=False)