### Author: Ally Sprik
### Last-updated: 25-02-2024

Goal of this notebook is use intelligent imputation on the CA125 value in the MAYO dataset, to see if it has a strong effect



In [None]:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns

pd.options.mode.copy_on_write = True  # This will allow the code to run faster and keep Pandas happy. Technical detail: https://pandas.pydata.org/pandas-docs/stable/user_guide/copy_on_write.html#

df_MAYO = pd.read_csv("../../0. Source_files/0.2. Cleaned_data/MAYO_cleaned_model.csv")
df_PIP = pd.read_csv('../../0. Source_files/0.2. Cleaned_data/Casper_PIPENDO_Cleaned.csv')

# columns not in PIPENDO dataset fill in with NA
for col in df_MAYO.columns:
    if col not in df_PIP.columns:
        df_PIP[col] = np.nan

Check the column factors of both datasets

Pseudocode:
- For every column in the MAYO dataset:
 - For every column in the PIPENDO dataset:
  - If the column names are the same:
   - Check if the factors are the same length
    - If not, print the column name and the factors

In [None]:
# Check if all the columns have the same factors in both datasets
for col in df_MAYO.columns:
    if col in df_PIP.columns:
        MAYO_factors = df_MAYO[col].unique()
        PIP_factors = df_PIP[col].unique()
        if len(MAYO_factors) != len(PIP_factors):
            print(col)
            print(MAYO_factors)
            print(PIP_factors)
            print("")

Select the columns that are used for the imputation, based on a matching scheme. For every row in MAYO, the PIPENDO dataset is checked to see there is a closely matching row. If there are 7 or more matching columns, the CA125 value is added to a list. At the end of the PIPENDO dataset the list is counted and the most common value is used to fill in the missing value in the MAYO dataset.

Pseudocode:
- Select the columns that are used for the imputation
- For every row in the MAYO dataset:
 - If the CA125 value is missing:
  - Create an empty list
  - For every row in the PIPENDO dataset:
   - Create a counter
   - For every column in the imputation columns:
    - If the MAYO value is missing or the PIPENDO value is missing:
     - skip to the next column
    - If the MAYO value is the same as the PIPENDO value:
     - Add 1 to the counter
   - If the counter is 7 or higher:
    - Add the CA125 value to the list
  - Count the list
  - If the list is empty:
   - Skip to the next row
  - Else:
   - Fill in the most common CA125 value in the MAYO dataset

In [None]:
imputation_columns = ["ER", "PR", "p53", "L1CAM", "LVSI", "PreoperativeGrade", "PostoperativeGrade", "MyometrialInvasion"] 

# If 3 of these are the same then fill in CA125 of PIPENDO in MAYO
for i in range(len(df_MAYO)):
    if pd.isna(df_MAYO.loc[i, "CA125"]):
        ls = []
    
        for j in range(len(df_PIP)):
            x = 0
            for col in imputation_columns:
                MAYO_value = df_MAYO.loc[i, col]
                PIP_value = df_PIP.loc[j, col]
                if pd.isna(MAYO_value) or pd.isna(PIP_value):
                    continue
                elif MAYO_value == PIP_value:
                    x += 1
            if x >= 7:
                ls.append(df_PIP.loc[j, "CA125"])
        
        vlcounts = pd.Series(ls).value_counts(dropna=False)
        
        if len(vlcounts) == 0:
            continue
        else:
            df_MAYO.loc[i, "CA125"] = vlcounts.index[0]


In [None]:
df_MAYO.to_csv("../../0. Source_files/0.3. Imputed_data/Informed_imputation_CA125.csv", index=False)