<h1>Setup: selecting the training data, validation data, and labelling</h1>

We know that the autonomous algorithms that work in 2-dimensional space and applied according to some gating strategy struggle to capture relevant cell populations in disease datasets where there is alot of variation. 

An alternative approach is taking a representative sample, labelling the cells in that sample according to our gating strategy with a traditional manual approach, and then using this labelled dataset to train a classifier. The theory here is that dispite the inter-sample variation, there exists biological signals (cell populations) that are shared across samples and are distinguishable in our high-dimensional expression data.

In this notebook I will, for each experiment, create training data, validation data, and label according to our standard gating strategy; that is I will identify the major cell subsets of interest and label cells by these subsets.

Training data is a difficult question, because what is a representative sample? To address this issue I will create two different sources of training data:
1. The training sample will be chosen by computing the euclidean norm of each pair of samples and choose the sample that is the minimum i.e. the most central sample in a shared euclidean space.
2. Cells will be sampled uniformally from each patient to create a concatenated 'global sample', containing cells from every patient. This should capture the variation from every patient.

The best performing method of training will be chosen based on weighted F1 score when compared to validation data.

In [2]:
import sys
if '/home/ross/immunova' not in sys.path:
    sys.path.append('/home/ross/immunova')
from immunova.data.mongo_setup import pd_init
from immunova.data.fcs_experiments import FCSExperiment
from immunova.flow.gating.actions import Gating
from immunova.flow.gating.defaults import ChildPopulationCollection
from immunova.flow.supervised.cell_classifier import create_reference_sample
from immunova.flow.supervised.utilities import calculate_ref_sample_fast
from warnings import filterwarnings
from tqdm import tqdm_notebook
import matplotlib
import pandas as pd
import os
filterwarnings('ignore')
pd_init()

In [3]:
# Load experiments
pdtexp = FCSExperiment.objects(experiment_id='PD_T_PDMCs').get()
pbtexp = FCSExperiment.objects(experiment_id='PD_T_PBMCs').get()
nexp = FCSExperiment.objects(experiment_id='PD_N_PDMCs').get()

<h2>Choosing the training data</h2>

<h3>The 'average' sample</h3>

Immunova has many utility functions. One of these selects the sample with the smallest average distance to all other samples in euclidean space; for every 2 samples i, j compute the Frobenius norm of the difference between their covariance matrics and then select the sample with the smallest average distance to all other samples. This is performed as described in Li et al paper (https://www.ncbi.nlm.nih.gov/pmc/articles/PMC5860171/)

There are two implementations in Immunova: `calculate_reference_sample` and `calculate_ref_sample_fast`, the only difference in the later is that it uses multi-processing for faster computation.

In [4]:
exclude = [f'{x}_pdmc_t' for x in ['209-03', '210-14', '273-01', '298-01', '322-01', '237-06', '302-01']]
print(f'PDMCs T Panel ref sample: {calculate_ref_sample_fast(pdtexp, exclude_samples=exclude, sample_n=1000)}')

PDMCs T Panel ref sample: 288-02_pdmc_t


In [5]:
exclude = [f'{x}_pbmc_t' for x in ['305-01', '286-02']]
print(f'PBMCs T Panel ref sample: {calculate_ref_sample_fast(pbtexp, exclude_samples=exclude, sample_n=1000)}')

PBMCs T Panel ref sample: 305-02_pbmc_t


In [6]:
exclude = [f'{x}_pdmc_n' for x in ['209-05']]
print(f'PDMCs N Panel ref sample: {calculate_ref_sample_fast(nexp, exclude_samples=exclude, sample_n=1000)}')

PDMCs N Panel ref sample: 294-03_pdmc_n


<h3>Uniform sample</h3>

An alternative to the above is to take a uniform sample from every patient in an experiment and create a concatenated sample that is 'representative' of variance seen across all samples. This can be achieved by using the `create_reference_sample` function from `flow.supervised.cell_classifier`. This function takes an experiment, the root population (e.g. if this was 'Live_CD3+' then cells will be sampled from the population named 'Live_CD3+' in every patient), a list of files that should be excluded, the naming convention for the new file that is created, and how many/what proportion of cells to sample.

In [7]:
exclude = [f'{x}_pdmc_t' for x in ['209-03', '210-14', '273-01', '298-01', '322-01', '237-06', '302-01']]
create_reference_sample(experiment=pdtexp,
                        root_population='single_Live_CD3+',
                        exclude=exclude,
                        new_file_name='Global_Uniform_Sample',
                        sampling_method='uniform',
                        sample_n=10000)

-------------------- Generating Reference Sample --------------------
Finding features common to all fcs files...
Sampling 142-09_pdmc_t...
Sampling 165-09_pdmc_t...
Sampling 175-09_pdmc_t...
Sampling 209-05_pdmc_t...
Sampling 239-02_pdmc_t...
Sampling 239-04_pdmc_t...
Sampling 251-07_pdmc_t...
Sampling 251-08_pdmc_t...
Sampling 254-04_pdmc_t...
Sampling 254-05_pdmc_t...
Sampling 255-04_pdmc_t...
Sampling 255-05_pdmc_t...
Sampling 264-02_pdmc_t...
Sampling 267-02_pdmc_t...
Sampling 276-01_pdmc_t...
Sampling 286-03_pdmc_t...
Sampling 286-04_pdmc_t...
Sampling 294-02_pdmc_t...
Sampling 294-03_pdmc_t...
Sampling 305-01_pdmc_t...
Sampling 305-03_pdmc_t...
Sampling 306-01_pdmc_t...
Sampling 308-02R_pdmc_t...
Sampling 308-03R_pdmc_t...
Sampling 308-04_pdmc_t...
Sampling 310-01_pdmc_t...
Sampling 315-01_pdmc_t...
Sampling 315-02_pdmc_t...
Sampling 318-01_pdmc_t...
Sampling 323-01_pdmc_t...
Sampling 324-01_pdmc_t...
Sampling 326-01_pdmc_t...
Sampling 267-01_pdmc_t...
Sampling 279-03_pdmc_t...


In [8]:
exclude = [f'{x}_pbmc_t' for x in ['305-01', '286-02']]
create_reference_sample(experiment=pbtexp,
                        root_population='single_Live_CD3+',
                        exclude=exclude,
                        new_file_name='Global_Uniform_Sample',
                        sampling_method='uniform',
                        sample_n=10000)

-------------------- Generating Reference Sample --------------------
Finding features common to all fcs files...
Sampling 142-09_pbmc_t...
Sampling 165-09_pbmc_t...
Sampling 175-09_pbmc_t...
Sampling 210-14_pbmc_t...
Sampling 239-02_pbmc_t...
Skipping 239-02_pbmc_t as single_Live_CD3+ is absent from gated populations
Sampling 239-04_pbmc_t...
Skipping 239-04_pbmc_t as single_Live_CD3+ is absent from gated populations
Sampling 251-08_pbmc_t...
Sampling 254-04_pbmc_t...
Sampling 254-05_pbmc_t...
Sampling 255-04_pbmc_t...
Sampling 255-05_pbmc_t...
Sampling 264-02_pbmc_t...
Sampling 273-01_pbmc_t...
Sampling 276-01_pbmc_t...
Sampling 286-03_pbmc_t...
Sampling 286-04_pbmc_t...
Sampling 294-02_pbmc_t...
Sampling 294-03_pbmc_t...
Sampling 298-01_pbmc_t...
Sampling 305-02_pbmc_t...
Sampling 305-03_pbmc_t...
Sampling 306-01_pbmc_t...
Sampling 308-01_pbmc_t...
Sampling 308-02R_pbmc_t...
Sampling 308-03R_pbmc_t...
Sampling 308-04_pbmc_t...
Sampling 310-01_pbmc_t...
Sampling 315-01_pbmc_t...
Samp

In [9]:
exclude = [f'{x}_pdmc_n' for x in ['209-05']]
create_reference_sample(experiment=nexp,
                        root_population='Single_Live_CD45+',
                        exclude=exclude,
                        new_file_name='Global_Uniform_Sample',
                        sampling_method='uniform',
                        sample_n=10000)

-------------------- Generating Reference Sample --------------------
Finding features common to all fcs files...
Sampling 142-09_pdmc_n...
Sampling 210-14_pdmc_n...
Sampling 239-02_pdmc_n...
Sampling 239-04_pdmc_n...
Sampling 251-07_pdmc_n...
Sampling 251-08_pdmc_n...
Sampling 254-04_pdmc_n...
Sampling 254-05_pdmc_n...
Sampling 255-04_pdmc_n...
Sampling 255-05_pdmc_n...
Sampling 264-02_pdmc_n...
Sampling 267-02_pdmc_n...
Sampling 273-01_pdmc_n...
Sampling 276-01_pdmc_n...
Sampling 286-03_pdmc_n...
Sampling 286-04_pdmc_n...
Sampling 294-02_pdmc_n...
Sampling 294-03_pdmc_n...
Sampling 298-01_pdmc_n...
Sampling 305-01_pdmc_n...
Sampling 305-02_pdmc_n...
Sampling 305-03_pdmc_n...
Sampling 306-01_pdmc_n...
Sampling 308-01_pdmc_n...
Sampling 308-02R_pdmc_n...
Sampling 308-03R_pdmc_n...
Sampling 310-01_pdmc_n...
Sampling 315-01_pdmc_n...
Sampling 315-02_pdmc_n...
Sampling 318-01_pdmc_n...
Sampling 320-01_pdmc_n...
Sampling 321-01_pdmc_n...
Sampling 322-01_pdmc_n...
Sampling 323-01_pdmc_n...
