# Preprocess training data for main cell types
This notebook is for preprocessing annotated data which has been exported from QuPath.
The data will be exported for XGBoost training or any supervised machine learning method of choice.  

In [148]:
%pwd
# change you working directory to the xgboost-cell-phenotype folder
%cd /Users/yokote.k/Desktop/MIBI/xgboost-cell-phenotype/

/Users/yokote.k/Desktop/MIBI/xgboost-cell-phenotype


In [149]:
import pandas as pd
import numpy as np

import re
import os
import json

## Input output files/folders

In [150]:
""" 
Give your batch a name. Eg. If it is one of many batches of NSCLC, it could be
NSCLCcohort_batch1_main_cell_types
"""

batch_name = "test_run"

In [151]:
"""
Where the preprocessed files should be stored. The folder will be created if it doesn't already exist
"""

output_folder = "resources/data/output"
os.makedirs(output_folder, exist_ok=True)

In [152]:
"""
The raw data exported from QuPath to be preprocessed
"""

expression_mat_path = "resources/data/raw/test_training_raw_data.csv"
expression_df = pd.read_csv(expression_mat_path)

## Encoding the cell types
When we train a machine learning model, the labels cannot be of type "string". For those new to coding, a string is a series of characters (can be alpha numeric, or special characters like .,!? etc) which are surrounded by quotation marks. The string data type is used to store names of things eg. the cell type of a particular cell. However, a machine learning program cannot really deal with strings, so instead what we need to do is to make an encoding or a mapping for each cell type to an integer. 

eg.   
Other       -> 0  
Epithelial  -> 1    
Stromal     -> 2, etc. 

In [153]:
"""
List all the cell types which you have defined in your manual labelling exactly how they appear.
"""
# Check that all the cell types are there
# remove the Edited prefix which may have occured from the qupath script
expression_df.loc[:, "Class"] = expression_df.loc[:, "Class"].str.replace("Edited: ", "")
expression_df.loc[:, "Name"] = expression_df.loc[:, "Name"].str.replace("Edited: ", "")

cell_types = expression_df.loc[:, "Class"].unique()
cell_types = sorted(cell_types)
print("Defined cell types:\n", cell_types)

Defined cell types:
 ['B cells', 'CD4 T cells', 'CD8 T cells', 'Dendritic cells', 'Epithelial cells', 'Granulocytes', 'Macrophages', 'Mast cells', 'NK cells', 'Other', 'Stromal cells', 'Treg cells', 'yd T cells']


In [154]:
# encoder for converting your labels
encoder = {cell_types[i]:i for i in range(len(cell_types))}

# decoder for decoding the results of the model. Save somewhere safe. 
decoder = {i:cell_types[i] for i in range(len(cell_types))}

with open(os.path.join(output_folder, "decoder.json"), "w") as json_file:
    json.dump(decoder, json_file, indent=4)

print("Encoding:", encoder)

Encoding: {'B cells': 0, 'CD4 T cells': 1, 'CD8 T cells': 2, 'Dendritic cells': 3, 'Epithelial cells': 4, 'Granulocytes': 5, 'Macrophages': 6, 'Mast cells': 7, 'NK cells': 8, 'Other': 9, 'Stromal cells': 10, 'Treg cells': 11, 'yd T cells': 12}


## Save the labels

In [168]:
"""
Save the labels as a separate csv file. The labels will be encoded with the above encoding
"""
filename = os.path.join(output_folder, "{}_cell_type_labels.csv".format(batch_name))
labels = expression_df.loc[:, ["Name"]]
labels = labels.replace({"Name" : encoder})
labels.to_csv(filename, index=False)

## Save the image, coordinate columns and any additional meta-data

In [156]:
"""
If for some reason, the centroid measurements are done in pixels and not µm, this will convert the pixel values to microns. 

The pixel_size variable is the microns/pixel. This information should be available somewhere idk. 
"""
pixel_size = 0.3906

for dim in ["X", "Y"]:
    try:
        null_arr = expression_df.loc[:, "Centroid {} µm".format(dim)].isnull()
        if null_arr.any() != False:
            expression_df.loc[null_arr.values, "Centroid {} µm".format(dim)] = expression_df.loc[null_arr.values, "Centroid {} px".format(dim)] * pixel_size
            expression_df.drop(["Centroid {} px".format(dim)], axis=1)
    except:
        expression_df.loc[:, "Centroid {} µm".format(dim)] = expression_df.loc[:, "Centroid {} px".format(dim)] * pixel_size
        expression_df = expression_df.drop(["Centroid {} px".format(dim)], axis=1)

In [157]:
"""
Additional meta-data you want to keep. Eg. If you have a column indicating whether the cell is in tumour or not... 
"""
additional_meta_data = []

image_coord_cols = ["Image", "Centroid X µm", "Centroid Y µm"] + additional_meta_data
image_coord_df = expression_df.loc[:, image_coord_cols]

"""
Save the image and coordinate columns. This is for when we want to import the results back into qupath
"""
image_coord_file_name = os.path.join(output_folder, "{}_images.csv".format(batch_name))
image_coord_df.to_csv(image_coord_file_name, index=False)

## The frequency of each cell type
When training a machine learning model, one must consider the 'balance' of the data. If one or more cell types dominate the data-set, if the model performs well on the majority cell type, the accuracy will be high but will be a garbage model for the less frequent cell types. For this reason, it may be worth considering adding more of the rarer subtypes to try and compensate for this imbalance.   

For the around 50,000 cells we annotated for the NSCLC cohort, at the very minimum, we had at least 200 annotations for each sub-type. 

In [158]:
expression_df.loc[:, "Class"].value_counts()

Epithelial cells    429
Other               377
NK cells             38
Stromal cells        33
Macrophages          28
CD4 T cells          23
CD8 T cells          19
B cells              18
Granulocytes         18
Dendritic cells       8
Treg cells            4
Mast cells            3
yd T cells            2
Name: Class, dtype: int64

## Preprocess the numerical measurements

In [159]:
preprocessed_expression_df = expression_df.copy()

"""
Remove unnecessary prefixes and underscores. 
"""
preprocessed_expression_df.columns = preprocessed_expression_df.columns.str.replace("Target:", "")
preprocessed_expression_df.columns = preprocessed_expression_df.columns.str.replace("_", " ")

In [160]:
"""
Collects all of the markers in this cohort
"""
# markers to include
markers = [col.replace(": Cell: Mean", "") for col in preprocessed_expression_df.columns if "Cell: Mean" in col]
print(markers)


['Beta-Tubulin', 'CD103', 'CD11c', 'CD14', 'CD16', 'CD163', 'CD20', 'CD206', 'CD3', 'CD39', 'CD4', 'CD45', 'CD45RA', 'CD45RO', 'CD49a', 'CD56', 'CD66b', 'CD68', 'CD69', 'CD8a', 'CTLA4', 'dTCR', 'GrzB', 'ICOS', 'IFN-y', 'LAG3', 'MHC I (HLA Class1)', 'MHC II (HLA-DR)', 'OX40', 'panCK', 'PD1', 'PDL1', 'Tim3', 'Tryptase', 'Vimentin']


In [161]:
"""
Define any markers you want to remove from the phenotyping. 

In this step, markers which do not help in determining the cell type should be removed. For example, dsDNA will not help in determining
cell types. 

Any markers where the staining did not work should also be removed.
"""
excluded_markers = ["dsDNA", "Beta-Tubulin", "CD39"]

markers = [marker for marker in markers if marker not in excluded_markers]

In [162]:
"""
Keep only the columns with the markers you want to keep.
"""
markers_ = [s + ": " for s  in markers]
measurement_columns = [col for col in preprocessed_expression_df.columns if any(map(col.__contains__, markers_))]
preprocessed_expression_df = preprocessed_expression_df.loc[:, measurement_columns]

In [163]:
"""
Due to the segmentation, some cells will not have a cytoplasm compartment. That is because the nuclei boundary and the cell boundary
are the same pixels. This usually occurs in densely packed tumours where the nuclei and cell boundary merge. 

Because of this, some cells will have missing values in the cell cytoplasm measurements. We will therefore, instead of imputing the
missing values with a 0, we will use the membrane measurement. This is a more representative way to impute the missing measurements. 
"""

for col in preprocessed_expression_df.columns:
    null_arr = preprocessed_expression_df.loc[:, col].isnull()
    if null_arr.values.any():
        if "Cytoplasm" in col: 
            new_col = col.replace("Cytoplasm", "Membrane", 1)
            preprocessed_expression_df.loc[null_arr.values, col] = preprocessed_expression_df.loc[null_arr.values, new_col]

In [164]:
"""
Check which columns still have NA values. This will be an issue with the measurement names across different images, and cohorts. 
If the problem is due to different measurement names across different images, this can be fixed by changing the names for the columns
in the images where this is a problem. 

Some things to check:
* Do all of my images have the same channel names on QuPath?
* Did I change the channel names before or after the segmentation? 
    * If after, the measurements would have been created using the previous channel names.
    * You can either change the names of the columns (best option) or if the channel names were
      completely different and you don't know which corresponds to which, you should re run the segmentation
      using the new channel names. 
"""
print(preprocessed_expression_df.columns[preprocessed_expression_df.isna().any()].values)

[]


In [165]:
"""
If you believe that some compartments should not be considered during the phenotyping, remove them here
"""
compartments = []
compartments_cols_to_remove = [col for col in preprocessed_expression_df.columns if any(map(col.__contains__, compartments))]
preprocessed_expression_df = preprocessed_expression_df.drop(columns=compartments_cols_to_remove)

"""
If you believe that some statistics should not be considered during the phenotyping, remove them here
"""
statistics = []
statistics_cols_to_remove = [col for col in preprocessed_expression_df.columns if any(map(col.__contains__, statistics))]
preprocessed_expression_df = preprocessed_expression_df.drop(columns=statistics_cols_to_remove)

## Save the preprocessed measurements

In [166]:
"""
Save the input preprocessed data
"""
preprocessed_expression_df_path = os.path.join(output_folder,  "{}_preprocessed_input_data.csv".format(batch_name))
preprocessed_expression_df.to_csv(preprocessed_expression_df_path, index=False)