# Data preparation

In [13]:
import numpy as np
import pandas as pd

Set a random seed number for attribute subsampling.

In [14]:
sampler_seed = 12

Load label data, patient info data, and miRNA data. In each case, set the `ID` column as the index of the data frame.

In [15]:
df_labels = pd.read_csv("data/labels_onehot.csv", delimiter=';').set_index("ID")
df_patient = pd.read_csv("data/patient-info_norm.csv", delimiter=';').set_index("ID")
df_mirna = pd.read_csv("data/mirna-expression_norm.csv", delimiter=';').set_index("ID")

We are going to work with the `LumP` column as the target labels the ML analyses.

In [16]:
label_name = "LumP"

We are going to select a subset of the attributes from the patient info data frame, and we want to separate the categorical and continuous attributes within that subset. We do it by defining three variables.

In [17]:
categorical_patient_attribute_names = ["gender", "history_of_neoadjuvant_treatment", "primary_lymph_node_presentation_assessment", "lymphovascular_invasion_present", "neoplasm_histologic_grade", "history_non_muscle_invasive_blca"]
continuous_patient_attribute_names = ["days_to_birth", "weight", "number_pack_years_smoked", "age_at_initial_pathologic_diagnosis"]
patient_attribute_names = continuous_patient_attribute_names + categorical_patient_attribute_names

We create a new patient info data frame with the desired subsample of columns.

In [18]:
df_patient_subsampled = df_patient.loc[:, patient_attribute_names]

Take a look at the dimensions of the miRNA data frame.

In [19]:
df_mirna.shape

(409, 1881)

That's far too many columns for our modest computational resources. We are going to use instead a random subsample of 100 of those columns.

In [20]:
df_mirna_subsampled = df_mirna.sample(n=100, axis=1, random_state=sampler_seed)

Now we create a combined data set with the labels, the subsampled patient info, and the subsampled miRNA data. This data set is made using inner joins in order discard patients that are not recorded in any of the source data sets.

In [22]:
df = pd.DataFrame(df_labels.loc[:, label_name])
df = df.join(df_patient_subsampled, how="inner").join(df_mirna_subsampled, how="inner")

We end up with 403 patients and 116 data columns

In [24]:
df.shape

(403, 111)

Finally, we divide the data again, separating labels, categorical and continuous columns in order to store the data ready for analysis.

In [25]:
df_labels = df.loc[:, label_name]
df_continuous = df.loc[:, continuous_patient_attribute_names + list(df_mirna_subsampled.columns)]
df_categorical = df.loc[:, categorical_patient_attribute_names].astype(int)

Save the prepared data frames.

In [26]:
df.loc[:, categorical_patient_attribute_names] = df.loc[:, categorical_patient_attribute_names].astype(int)
df.loc[:, label_name] = df.loc[:, label_name].astype(int)
df.to_csv("data/dataset.tsv", sep='\t')

df_labels.to_csv("data/labels.tsv", sep='\t')
df_continuous.to_csv("data/continuous.tsv", sep='\t')
df_categorical.to_csv("data/categorical.tsv", sep='\t')