# HistoMIL Preprocessing Notebook

This Jupyter notebook is designed to guide users through the process of performing various preprocessing steps on histopathology whole-slide images using HistoMIL. This includes tissue segmentation, patching (tiling), and feature extraction. All preprocessing steps will be performed in batch. Predefined preprocessing parameters can be found in the HistoMIL package and can be modified in this notebook.

Additionally, this notebook will demonstrate how to perform preprocessing steps on a single slide file.

## Getting Started

Before proceeding with this notebook, please make sure that you have followed the setup instructions provided in the project's README file. This includes creating a conda environment and installing the required dependencies.

## Batch Preprocessing

The batch preprocessing pipeline in HistoMIL consists of the following steps:

Tissue segmentation
Patching (tiling)
Feature extraction
The default preprocessing parameters can be found in the HistoMIL/EXP/paras/slides.py file. You can modify these parameters to customize the preprocessing pipeline for your specific needs.

To perform batch preprocessing, you can use the cohort_slide_preprocessing function in the Experiment.cohort_slide_preprocessing module (HistoMIL.EXP.workspace.experiment.Experiment). Here's an example of how to run batch pre-processing:

In [1]:
%load_ext autoreload
%autoreload 2

# Set HistoMIL in PATH or change directory to where HistoMIL is

In [3]:
import os
os.getcwd()
os.chdir('/Users/awxlong/Desktop/my-studies/hpc_exps/') # 'path/to/ parent dir of HistoMIL'

In [4]:
# avoid pandas warning
import pandas as pd
pd.options.mode.chained_assignment = None  # default='warn'
# avoid multiprocessing problem
import torch
torch.multiprocessing.set_sharing_strategy('file_system')

#------>stop skimage warning
import warnings
warnings.filterwarnings("ignore", category=DeprecationWarning)
import imageio.core.util
import skimage 
def ignore_warnings(*args, **kwargs):
    pass
imageio.core.util._precision_warn = ignore_warnings

#set logger as INFO
from HistoMIL import logger
import logging
logger.setLevel(logging.INFO)

import pickle
import timm
import csv

Pyarrow will become a required dependency of pandas in the next major release of pandas (pandas 3.0),
(to allow more performant data types, such as the Arrow string type, and better interoperability with other libraries)
but was not found to be installed on your system.
If this would cause problems for you,
please provide us feedback at https://github.com/pandas-dev/pandas/issues/54466
        
  import pandas as pd
  from .autonotebook import tqdm as notebook_tqdm


In [5]:
from HistoMIL.EXP.paras.env import EnvParas
from HistoMIL.EXP.workspace.experiment import Experiment
from HistoMIL import logger
import logging
logger.setLevel(logging.INFO)

In [6]:
#--------------------------> parameters for reading data

preprocess_env = EnvParas()
preprocess_env.exp_name = "wandb exp name"      # e.g. "debug_preprocess"
preprocess_env.project = "wandb project name"   # e.g. "test-project" 
preprocess_env.entity =  "wandb entity name"    # make sure it's initialized to an existing wandb entity

#----------------> cohort
# you can find more options in HistoMIL/EXP/paras/cohort.py
preprocess_env.cohort_para.localcohort_name = "COAD" # cohort name, e.g. 'BRCA'
preprocess_env.cohort_para.task_name = "g0_arrest"     # biomarker name, e.g., 'g0_arrest' and HAS TO COINCIDE with column name
preprocess_env.cohort_para.cohort_file = f'local_cohort_{preprocess_env.cohort_para.localcohort_name}.csv' # e.g. local_cohort_BRCA.csv, this is created automatically, and contains folder, filename, slide_nb, tissue_nb, etc. 
preprocess_env.cohort_para.task_file = f'{preprocess_env.cohort_para.localcohort_name}_{preprocess_env.cohort_para.task_name}.csv' # e.g. BRCA_g0_arrest.csv, which has PatientID matched with g0_arrest labels. This is SUPPLIED by the user and assumed to be stored in the EXP/Data/ directory
preprocess_env.cohort_para.pid_name = "PatientID"           # default column for merging tables
preprocess_env.cohort_para.targets = ['g0_arrest']  # ['name of target_label column'] e.g.  ["g0_arrest"]  # the column name of interest # supply as a list of biomarkers
preprocess_env.cohort_para.targets_idx = 0                  
preprocess_env.cohort_para.label_dict ="{'negative':0,'positive':1}"  # SINGLE quotations for the keys, converts strings objects to binary values
# preprocess_env.cohort_para.task_additional_idx = ["g0_score"] # if CRC_g0_arrest.csv has other biomarkers of interest, name them in this variable, default None. 


## write a sample task cohort file

In [23]:
# Input and output file names
input_file = '/Users/awxlong/Desktop/my-studies/hpc_exps/HistoMIL/gdc_manifest.2024-06-18.txt' # path to the manifest.txt file which has filenames of diagnostic slides downloaded from TCGA
output_file = f'/Users/awxlong/Desktop/my-studies/hpc_exps/Data/{preprocess_env.cohort_para.task_file}' # path to Data/ dir inside the experiment directory where these cohort .csv files are stored 
# Read the input file and process the filenames
with open(input_file, 'r') as infile, open(output_file, 'w', newline='') as outfile:
    reader = csv.reader(infile, delimiter='\t')  # Assuming tab-separated values
    writer = csv.writer(outfile)
    
    # Write header to the output file
    writer.writerow([preprocess_env.cohort_para.pid_name, preprocess_env.cohort_para.task_name])
    
    # Skip the header row of the input file
    next(reader)
    
    # Process each row
    for row in reader:
        if len(row) >= 3:  # Ensure the row has at least 3 columns
            filename = row[1]  # The filename is in the third column (index 2)
            index = filename[:12]  # Take the first 12 characters as the index
            even_odd = 'positive' if int(row[3])%2 == 0 else 'negative'
            writer.writerow([index, even_odd])
            
print(f"Processing complete. Output saved to {output_file}")


Processing complete. Output saved to /Users/awxlong/Desktop/my-studies/hpc_exps/Data/COAD_g0_arrest.csv


In [None]:
# #----------------> model specifications for preprocessing
# slide-level parameters
print(preprocess_env.collector_para.slide)

# tissue-level parameters
print(preprocess_env.collector_para.tissue)

# patch-level parameters
preprocess_env.collector_para.patch.step_size = int('your step size for patching') # e.g. 224 # ASSUME this also decides the size of patch, although you can change this
preprocess_env.collector_para.patch.patch_size = (int('your step size for patching'), int('your step size for patching')) # can change this, default is 512, 512
print(preprocess_env.collector_para.patch)

# feature-extraction parameters
# by default uses resnet18
BACKBONES = {
    'UNI': "Not Implemented",
    'prov-gigapath' : timm.create_model("hf_hub:prov-gigapath/prov-gigapath", pretrained=True)
}
backbone_name = None # 'name of feature extractor, e.g. prov-gigapath. If none, by default HistoMIL uses resnet18'
if backbone_name:
    preprocess_env.collector_para.feature.model_name = backbone_name               # e.g. 'prov-gigapath'
    preprocess_env.collector_para.feature.model_instance = BACKBONES[backbone_name] # timm.create_model("hf_hub:prov-gigapath/prov-gigapath", pretrained=True)
print(preprocess_env.collector_para.feature)

    

In [None]:

#----------------> dataset -- > not sure what this is 
preprocess_env.dataset_para.dataset_name = 'dataset name' # e.g. "DNAD_L2"
preprocess_env.dataset_para.concepts = 'concepts you want to use'    # default ['slide', 'tissue', 'patch', 'feature'] in this ORDER
preprocess_env.dataset_para.split_ratio = 'split ratio which sum to one'   # e.g [0.99,0.01]

In [None]:
#--------------------------> init machine and person by reading pkl file from notebook 0

machine_cohort_loc = "Path/to/BRCA_machine_config.pkl"
with open(machine_cohort_loc, "rb") as f:   # Unpickling
    [data_locs,exp_locs,machine,user] = pickle.load(f)
preprocess_env.data_locs = data_locs
preprocess_env.exp_locs = exp_locs

In [None]:

#--------------------------> setup experiment for preprocessing (no ssl)
logger.info("setup experiment")
from HistoMIL.EXP.workspace.experiment import Experiment
exp = Experiment(env_paras=preprocess_env)
exp.setup_machine(machine=machine,user=user)
logger.info("setup data")
exp.init_cohort()                   # This will create 2 files inside EXP/Data/: local_cohort_BRCA.csv which has filenames of WSIs stored in TCGA-BRCA/ and Task_g0_arrest.csv which merges the local_cohort_BRCA.csv with the supplied BRCA_g0_arrest.csv
logger.info("pre-processing..")
exp.cohort_slide_preprocessing(concepts=["slide","tissue","patch","feature"],
                                is_fast=True, force_calc=False)

## Single Slide Preprocessing

If you want to perform preprocessing steps on a single slide file, you can use the preprocess_slide function in the HistoMIL.DATA.Slide.collector.pre_process_wsi_collector  function. Here's how we define this function and an example of how to use this function:

In [None]:
from pathlib import Path
from HistoMIL.DATA.Slide.collector import WSICollector,CollectorParas
from HistoMIL.EXP.paras.slides import DEFAULT_CONCEPT_PARAS
def pre_process_wsi_collector(data_locs,
                            wsi_loc:Path,
                            collector_paras:CollectorParas,
                            concepts:list=["slide","tissue","patch"],
                            fast_process:bool=True,force_calc:bool=False):

    C = WSICollector(db_loc=data_locs,wsi_loc=wsi_loc,paras=collector_paras)
    try:

        for name in concepts:
            if name == "tissue":
                if fast_process:
                    from HistoMIL.EXP.paras.slides import set_min_seg_level
                    C.paras.tissue = set_min_seg_level(C.paras.tissue, C.slide,C.paras.tissue.min_seg_level)
                    logger.debug(f"Collector:: set seg level to {C.paras.tissue.seg_level}")
            C.create(name)
            C.get(name, force_calc) # for tissue, req_idx_0 is always default slide
    except Exception as e:
        logger.exception(e)
    else:
        logger.info(f"Collector:: {wsi_loc} is done")
    finally:
        del C

folder = "folder of wsi/"
fname =  "name of wsi.svs"
wsi_loc = Path(str("/"+ folder +"/"+ fname))

pre_process_wsi_collector(data_locs,
                            wsi_loc,
                            collector_paras=DEFAULT_CONCEPT_PARAS,
                            )