### Author: Ally Sprik
### Last-updated: 25-02-2024

Goal of this notebook is to explore imputation with the MICE forest algorithm, which is a random forest based imputation method.

In [None]:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
import miceforest as mf
import random
import sklearn.neighbors._base 
import sys
pd.options.mode.copy_on_write = True  # This will allow the code to run faster and keep Pandas happy. Technical detail: https://pandas.pydata.org/pandas-docs/stable/user_guide/copy_on_write.html#

sys.modules['sklearn.neighbors.base'] = sklearn.neighbors._base
from sklearn.impute import KNNImputer
import tensorflow as tf
import lightgbm as lgb# Surpress warnings
import warnings
warnings.filterwarnings('ignore')


In [None]:
# https://github.com/AnotherSamWilson/miceforest
# Read the data 
data = pd.read_csv('../../0. Source_files/0.2. Cleaned_data/TrainTCGA_subdag.csv')
or_data= data.copy()


Select the data to be imputed

In [None]:
cols = ["CT_or_MRI_LNM", "MRI_MI", "Platelets_bi", "CA125_PREOP_bi", "Grade_PREOP", "Cytology_bi", 
		  "p53_expression_preop", "TP53_mutation", "L1CAM_expression_preop", "ER_expression_preop", "PR_expression_preop", 
		  "F_POLE_mutation", "F_MSI_bi", "F_NSMP", "LNM_bi", "LNM_incl_followup_bi", "Grade", "MI_merged", "LVSI", 
		  "FIGO_surgical", "Chemotherapy", "Radiotherapy", "Recurrence", "one_year_survival", "three_year_survival", "five_year_survival"]

data = data[cols]

Encode the data as categorical

Pseudocode:
- For each column in the data
    - Encode the column as categorical

In [None]:
# Using RF so no need to label encode
# Make data categorical
for column in data.columns:
    data[column] = data[column].astype('category')

define the imputation model

In [None]:
kds = mf.ImputationKernel(
    data,
    save_all_iterations=True,
    random_state=123,
)


Impute the data

Based on your configuration you can choose to use the CPU or GPU. The GPU is faster but not always available. Remove device="cuda" to use the CPU

In [None]:

kds.mice(50, verbose=True, device="cuda")

Get the completed data, with no input it takes the last iteration

In [None]:
completed_data = kds.complete_data()

Save the data

In [None]:
# Save the data
completed_data.to_csv("../../0. Source_files/0.3. Imputed_data/MCF-imputation-trainingdata.csv")