# TASK 5.2 Compare generated and original data

## Goal:  The double-date metric
- generated control cells/patients should be more similar to original control cells/patients than original diseased cells/patients
- generated diseased cells/patients should be more similar to original diseased cells/patients than original control data
- The difference between generated control and diseased cells/patients should be as close as possible to the difference between original control and diseased cells/patients.

## Process
1. Train a generator on each group seperatly on the original data.
2. Embedd the orignal data of each group seperately -> take each patients seperatly 
3. Generate patients -> embedd each patient using the same embedding space as the patients respective group.
4. Get the KL-divergence between each pair of patient.
    1. the kl-divergence is measured on the meshgrid spanning the union of each pair of patient cells.
    2. Get two kernel density estimators, each trained on their respectiv patients embedding. Then use KL-divergence to compare the the estimations over the meshgrid.
5. Use the double-date metric too compare generators.


## Critisim
1. Factors that might affect the metric
    1. The embedding algorthim. The properties of some algorithm used will transfer to the metric.
    2. The way KL-divergence is used. The span of the meshgrid will vary.
    3. The kernel density estimator. Different parameters might give different results.
    4. The generators themselves will vary with hyperparameters. (These are the models we to validate.)
    5. The number of generated patients might affect metric. 
    6. The double-date metric might not be a good metric itself.

### 1.  Train a generator on each group seperatly on the original data.

In [11]:
import pickle
import numpy as np

In [14]:
# load generated data 
# or load some generator 
# or train a generator - do this in another file
def load_generated_patients(file):
    """
    PARAMETERS
    ----------
    file: csv of generated cells for different patients.
    """
    return pd.read_csv(file) 


def load_generator(file):
    pass # TODO


def train_generator():
    pass # TODO

### 2. Embedd each group seperately -> take each patients seperatly

In [15]:
from openTSNE import TSNE
from openTSNE.callbacks import ErrorLogger

In [None]:
def umap_embedd(df, embedder=None):
    """
    PARAMETERS
    ----------
    df: dataframe with cells of patients
    embedder: umap embedding with existing embedded space.
    
    RETURNS
    -------
    embedder: umap embedder with existing embedded space
    
    embedded: embedding of df using embedder
    
    """
    
    group_embedding = dict()
    
    if(not embedder):
        embedder = umap.UMAP(n_components=2, n_neighbors=10, random_state=0)

    groups = ["control", "diseased"]
    for g in groups:
        group = df[df.group == g]
        group_cells = group[group.columns.difference(["id","group"])]
        group_embedding[g] = embedder.fit_transform(group_cells)
    return up, group_embedding
    
    
def tsne_embedd(df, embedder=None):  
    """
    PARAMETRS
    ---------
    df: dataframe with cells of patients
    embedder: OpenTSNE - TSNEEmbedding object
    
    RETURNS
    -------
    embedder: tsne embedder with existing embedded space
    
    embedded: embedding of df using embedder
    
    """
    group_embedding = dict()
    groups = ["control", "diseased"]
    if(not embedder):
        embedder = TSNE(n_components=2, perplexity=30, 
                    learning_rate=20, n_iter=1000, 
                    random_state=0)
        
        for g in groups:
            group = df[df.group == g]
            group_cells = group[group.columns.difference(["id","group"])]
            group_embedding[g] = tsne.fit(group_cells.values)
    else:
        for g in groups:
            group = df[df.group == g]
            group_cells = group[group.columns.difference(["id","group"])]
            group_embedding[g] = embeddor.transform(group_cells.values)
            
        
    return tsne, group_embedding

In [1]:
import umap

### 3. Get the KL-divergence between each pair of patient.

In [None]:
from sklearn.neighbors import KernelDensity

In [None]:
def get_kl_between_pairs(original, generated):
    """
    PARAMETERS
    ----------
    original: a dictionary containing the embedded control and patient of the original patients
    
    generated: a dictonary containing the embedded control and patient of the generated patients
    
    RETURNS
    -------
    pairwise_kl: a list of matrices. each matrix contains a kl between two groups.
    
    """
    def KL(a,b):
        a = np.asarray(a, dtype=np.float)
        b = np.asarray(b, dtype=np.float)
        a = np.exp(np.where(a!= float('-inf'), a, 0))
        b = np.exp(np.where(b!= float('-inf'), b, 0))

        cond = np.logical_and(b != float('-inf'), a!= float('-inf'), b != 0)
    
        return np.sum(np.where(cond, a * np.log(a / b), 0))
    
    def meshgrid(min_value = -15,max_value=15):
        grid_margins = [np.linspace(min_value, max_value, 120)] * 2
        grid = np.stack(np.meshgrid(*grid_margins), -1).reshape(-1, len(grid_margins))
        return grid

    def KL_between_two_patients(p1, p2, symmetric=True):
        max_1 = np.max(p1)
        min_1 = np.min(p1)
        max_2 = np.max(p2)
        min_2 = np.min(p2)
        X = meshgrid(min_value = np.min([min_1, min_2])-1, max_value = np.max([max_1,max_2])+1 )

        P_k_density_estimator = KernelDensity(kernel = "linear", bandwidth=1).fit(p1)
        P_sample_score = P_k_density_estimator.score_samples(X)

        Q_k_density_estimator = KernelDensity(kernel = "linear", bandwidth=1).fit(p2)
        Q_sample_score = Q_k_density_estimator.score_samples(X)
        kl_score = 0

        if(symmetric):
            kl_score = KL(P_sample_score, Q_sample_score) + KL(Q_sample_score, P_sample_score)
        else:
            kl_score = KL(P_sample_score, Q_sample_score)

        return kl_score
    

### 4. Double-date matric

### 5. Visualization of the pairwise KL-divergence