### Script preparation

The pipeline was tested successfully in python v3.11.11. The following packages are assumed to be installed by default:
1. "numpy" v2.0.2 for handling arrays and matrices,
2. "pandas" v2.2.3 for handling tabular data,
3. "matplotlib" v3.10.0 for plotting, and
4. "pickle" (part of the python standard library) for handling machine learning models as objects.

Install the packages required for the pipeline:
1. "scikit-learn" v1.6.0 package for model training, hyperparameter optimization, and validation,
2. "seaborn" v0.13.2 for plotting, and
3. "flowio" v1.3.0, "flowcal" 1.3.0, and "flowutils" v1.1.0 packages for handling flow cytometry standard (FCS) files.

This phrase indicates an input requirement that must be fulfilled by you.
<font color='orange'>"**INPUT:**"<font>

Other cells do not require any alterations and should be run without any change in the code.

<font color='orange'>To facilitate experimenting with the pipeline, one sample training dataset (Sample_training_dataset.csv) and one sample patient file (Sample_patient.fcs) have been provided, so the whole pipeline can be run as is (small exception at test step, read comment).<font>

In [None]:
!pip install scikit-learn
!pip install seaborn
!pip install flowio
!pip install FlowCal
!pip install flowutils

Import the required packages for the entire pipeline.

In [None]:
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import pickle
from sklearn.ensemble import RandomForestClassifier
from sklearn.metrics import classification_report
from sklearn.model_selection import GridSearchCV, StratifiedGroupKFold
import seaborn as sns
import flowio
import flowutils
import FlowCal
import warnings
warnings.filterwarnings('ignore')

### Training dataset

The tabular dataset for the training step needs a few prerequisits as described below:
1. Each row except the first one corresponds to one event acquired and exported in Infinicyt.
2. The columns names must include the forward and side scatter parameters, fluorescent parameters, one "Population" column with manual annotations, and one "Batch" column containing batch number or patient ID.
3. Make sure that there is no FSC-W, SSC-H, SSC-W, and TIME parameters (customizable but requires further changes in script).
4. Make sure that your MRD and erythroid series are named as "Residual Leukemic Cells", "Erythroid Precursors", and "Erythroid Cells", respectively (customizable but requires further changes in script).
5. Most importantly, make sure the names of the fluorescent parameters in your dataset matches the ones in your test FCS files.

A representation of the required dataset for 3 example events is provided below:

![Training dataset example.png](attachment:f45673f1-4488-460e-a10c-23d950e80b7f.png)

<font color='orange'>**INPUT:** Load the training dataset into a pandas DataFrame. Insert the training dataset name like below:

*dataset_name = 'example.csv'*

In [None]:
dataset_name = 'Sample_training_dataset.csv'    
data = pd.read_csv(dataset_name) 

### Training and validation steps

Define the training data, labels, and group labels. The "Population" is the column with annotations and the "Batch" is the column with patient or batch ID.

In [None]:
X = data.drop(['Population', 'FSC-H', 'Batch'], axis=1)
y = data['Population']
groups = data['Batch']

Define the parameter grid for hyperparameter optimization (customizable, directly impacts the computation time).

In [None]:
param_grid = {'n_estimators': [50, 100],
              'min_samples_split': [5, 20, 50, 150],
              'min_samples_leaf': [1, 20, 100]}

Perform the grid search and train the model to data with 5-fold cross-validation using batches (customizable, directly impacts the computation time). THIS STEP WILL TAKE UPTO SEVERAL HOURS TO COMPLETE.

In [None]:
CV = StratifiedGroupKFold(n_splits=5)
model = GridSearchCV(RandomForestClassifier(random_state=42, n_jobs=-1, criterion='entropy'),
                     param_grid, cv=CV, verbose=0, scoring='accuracy')
model.fit(X, y, groups=groups)

View the best parameters and save the GridSearchCV result.

In [None]:
print(model.best_params_)
pd.DataFrame(model.cv_results_).to_csv('gridsearchcv_results.csv')

<font color='orange'>**INPUT:** Save the model into an object for future use. Insert the model name like below:

*model_name = 'example.pkl'*

In [None]:
model_name = 'Sample_model.pkl'  
pickle.dump(model.best_estimator_, open(model_name, 'wb'))

### Test step

A function for analyzing test FCS files, saving a table of percentages populations present, and saving desired figures. 

<font color='orange'>Please read the comment at line 29.<font>

In [None]:
def analyze(address, ml_model, dotplot_num, dotplot_params):
    file_address = address
    global file_name
    file_name = file_address.replace('.fcs', '')
    fcs_file = flowio.FlowData(file_address)
    try:
        spill, markers = flowutils.compensate.get_spill(fcs_file.text['spill'])
    except KeyError:
        spill, markers = flowutils.compensate.get_spill(fcs_file.text['spillover'])
    raw_data = np.reshape(fcs_file.events, (-1, fcs_file.channel_count))
    fluoro_indices = []
    for channel in fcs_file.channels:
        if fcs_file.channels[channel]['PnN'] in markers:
            fluoro_indices.append(int(channel) - 1)
    fluoro_indices.sort()
    comp_data = flowutils.compensate.compensate(raw_data, spill, fluoro_indices)
    channel_list = []
    for i in range(1, fcs_file.channel_count + 1):
        channel_list.append(fcs_file.text['p{}n'.format(i)])
    flow_data = FlowCal.transform.to_rfi(comp_data, amplification_type=(tuple([(0, 0)] * len(channel_list))))
    events = pd.DataFrame(flow_data, columns=channel_list)
    total_events = events.loc[events['FSC-A'] / events['FSC-H'] < 2]
    for ch in ['FSC-H', 'SSC-H', 'Time']:
        try:
            total_events.drop(columns=[ch], inplace=True)
        except KeyError:
            pass
    model = ml_model

    # READ ME:
    # If you want to test the sample files (csv and fcs), to match the columns between the training dataset and sample fcs,
    # run the the next line of code as well (remove # from the beginning).
    #total_events = total_events[['FSC-A', 'SSC-A', 'FITC-A', 'PE-A', 'PerCP-Cy5.5-A', 'PE-Cy7-A', 'APC-A', 'APC-R700-A', 'APC-Cy7-A', 'V450-A', 'V500-C-A']]
    total_events['Predicted'] = model.predict(total_events)
    global wbc_events
    wbc_events = total_events.loc[(total_events['Predicted'] != 'Erythroid Cells') & (total_events['Predicted'] != 'Erythroid Precursors')]
    results = wbc_events['Predicted'].value_counts(normalize=True) * 100
    results.to_csv('{} results.csv'.format(file_name))
    global mrd_events
    mrd_events = wbc_events.loc[wbc_events['Predicted'] == 'Residual Leukemic Cells']
    for n in range(dotplot_num):
        dot_plot(dotplot_params[n][0], dotplot_params[n][1])

A function for creating and saving figures showing the MRD population based on your parameters of choice.

In [None]:
def dot_plot(x, y):
    sns.scatterplot(x=wbc_events[x], y=wbc_events[y], c='lightgrey', s=1)
    sns.scatterplot(x=mrd_events[x], y=mrd_events[y], c='maroon', s=3)
    if x in ['FSC-A', 'SSC-A']:
        plt.xscale('linear')
    else:
        plt.xscale('symlog', linthresh=1000)
        plt.xlim(left=-1000)
    if y in ['FSC-A', 'SSC-A']:
        plt.yscale('linear')
    else:
        plt.yscale('symlog', linthresh=1000)
        plt.ylim(bottom=-1000)
    plt.savefig('{}-{}-{}.png'.format(file_name, x, y))
    plt.close()

<font color='orange'>**INPUT:** Load the model from a saved pickle object. Insert the model name like below:

*model_name = 'example.pkl'*

In [None]:
model_name = 'Sample_model.pkl' 
model = pickle.load(open(model_name, 'rb'))

<font color='orange'>**INPUT:** Analyze a single fcs file. Add the name of the file, indicate the number of figures you want and the parameters in figures like below:

*file_name = 'example.fcs'*

*fig_num = 2*

*fig_params = [('FSC-A', 'SSC-A'), ('V500-C-A', 'SSC-A')]*

In [None]:
file_name = 'Sample_patient.fcs'
fig_num = 2
fig_params = [('FSC-A', 'SSC-A'), ('V500-C-A', 'SSC-A')]
analyze(file_name, model, fig_num, fig_params)