### Author: Ally Sprik
### Last-updated: 25-02-2024

Goal of this notebook is to explore the original training dataset used by Reijnen et al. (2020) and to prepare it for further analysis.

In [None]:
import numpy as np
import pandas as pd

df = pd.read_spss("../0. Data/Trainingcohort(_wTCGA)/Original_model_dataset.sav")
# fix NaN labels
df.replace({"":np.nan," ":np.nan,"99":np.nan,"NA":np.nan},inplace=True)
# drop columns that are not useful for the current analysis
for x in df.columns.values:
    if x.__contains__("Comorbidity") | x.__contains__('ESMO') | x.__contains__('filter') | x.__contains__('_20') | x.__contains__('_30') | x.__contains__('_40') | x.__contains__('_50'):
        df.drop(x,axis=1,inplace=True)

Create a binary LNM column based on the 'Positive_nodes_including_followup' column

In [None]:
df['LNM'] = df['Positive_nodes_including_followup'].apply(lambda x: 1 if x == 'yes, confirmed by lymphadenectomy' 
                                                          else(0 if x == 'no, confirmed by lymphadenectomy' else np.nan))
df['LNM'].value_counts(dropna=False)

There was an error in survival, it is assumed it was because the labels were switched. The labels are switched back to the correct labels.

In [None]:
# switch yes/no labels for three year survival and five year survival
df['three_year_survival'].replace({'yes':0,'no':1},inplace=True)
df['five_year_survival'].replace({'yes':0,'no':1},inplace=True)

df['three_year_survival'].replace({1:'yes',0:'no'},inplace=True)
df['five_year_survival'].replace({1:'yes',0:'no'},inplace=True)

The following codeblock is a standard codeblock to search for column names that contain a certain substring. If set to "", it will return all columns.

In [None]:
for x in df.columns.values:
    if x.__contains__("PREOP"):
        print(x)

See the grade distribution

In [None]:
df['Grade'].value_counts(dropna=False)

See the ER expression distribution

In [None]:
df["ER_expression_preop"].value_counts(dropna=False)

Replace the string labels with numerical labels

In [None]:
df.replace({
    "yes":1,
    "no":0,
    "positive":1,
    "negative":0,
    "wildtype":0,
    "overexpression":1,
    "Radiotherapy":1,
    "Chemotherapy":2,
    "Chemoradiation":3,
    "no invasion":0,
    "<50%":0,
    ">=50%":1,
    'no extra-uterine disease':0,
    "lymphadenopathy":1,
    "no LVSI":0,
    "LVSI":1,
    'not mentioned in the report':np.nan,
    'not elevated':0,
    'elevated':1,
    'grade 1':1,
    'grade 2':2,
    'grade 3 or non-endometrioid':3,
    'no trombocytosis':0,
    'trombocytosis':1,
    'no malignant cells':0,
    'malignant cells':1,
    'yes, local':1,
    'yes, regional':2,
    'yes, distant':3,
}, inplace=True)

Save the dataframe to a csv file

In [None]:
df.to_csv("../0. Data/Trainingcohort(_wTCGA)/Original_model_dataset.csv",index=False)