<h1>Clustering with PhenoGraph</h1>

The PDMC samples have been gated by a supervised algorithm, providing one methodology for feature extraction. Clustering is another method that can be used to this end. This is an 'unbiased' approach as events are clustered together based on simularities in high dimensional space as opposed to gating, which is biased to sequential selection of events in two-dimensional plots.

In [2]:
import sys
if '/home/rossco/immunova' not in sys.path:
    sys.path.append('/home/rossco/immunova')
from immunova.data.mongo_setup import pd_init
from immunova.data.fcs_experiments import FCSExperiment
from immunova.flow.supervised.cell_classifier import create_reference_sample
from immunova.flow.clustering.phenograph import PhenoGraph
from warnings import filterwarnings
from tqdm import tqdm_notebook
import matplotlib
import pandas as pd
import os
filterwarnings('ignore')
pd_init()

<h2>Clustering on a concatenated sample</h2>

Clustering can be performed on a per-sample basis and clusters matched between samples. This provides added complexity in that a suitable method must be chosen for matching clusters together. According to the literature, QFMatch (an adaption of the quadratic distance metric) is the best performing method. With that being said it is a complicated method that has not been validated across multiple datasets. The original method used in the PhenoGraph paper and has been replicated in multiple studies is medoid meta-clustering with PhenoGraph. Later on I will be using both methods. 

First I will take a concatenated sample and apply PhenoGraph clustering. This has the risk that the clustering algorithm will capture information that discerns patients from one another than some other global information relating to disease progression, cause or outcome. This first approach is purely exploratory.

In [3]:
texp = FCSExperiment.objects(experiment_id='PD_T_PDMCs').get()

In [4]:
concatenated_clustering = PhenoGraph(clustering_uid='PhenoGraph_071219', 
                                     experiment=texp, 
                                     sample_id='PD_T_PDMCs_sampled_data', 
                                     root_population='single_Live_CD3+')

In [4]:
from immunova.flow.gating.actions import Gating

In [5]:
g = Gating(texp, 'PD_T_PDMCs_sampled_data')

In [19]:
g.populations['nongdt'].children[1].children[0].children

()

In [13]:
g.populations.keys()

dict_keys(['single_Live_CD3+', 'gdt', 'nongdt', 'mait', 'classic', 'CD4+CD8+', 'CD4-CD8-', 'CD4+CD8-', 'CD4-CD8+', 'mait+CD4+CD8+', 'mait+CD4-CD8-', 'mait+CD4+CD8-', 'mait+CD4-CD8+'])

In [21]:
g.populations['single_Live_CD3+']



In [23]:
p = g.populations['single_Live_CD3+']

In [80]:
def population_labels(data: pd.DataFrame, root_node) -> pd.DataFrame:
    def recursive_label(d, n):
        mask = d.index.isin(n.index)
        d.loc[mask, 'population_label'] = n.name
        if len(n.children) == 0:
            return d
        for c in n.children:
            recursive_label(d, c)
        return d
    data = data.copy()
    data['population_label'] = root_node.name
    data = recursive_label(data, root_node)
    return data

In [81]:
d = g.get_population_df('single_Live_CD3+')
n = g.populations['single_Live_CD3+']

In [82]:
x = population_labels(d, n)

In [84]:
x.population_label.unique()

array(['CD4-CD8-', 'CD4+CD8+', 'CD4+CD8-', 'CD4-CD8+', 'mait+CD4+CD8+',
       'gdt', 'mait+CD4-CD8+', 'mait+CD4+CD8-', 'mait+CD4-CD8-'],
      dtype=object)

In [71]:
p.children[1]

