# TASK 5.2 Compare generated and original data

## Goal:  The double-date metric
- generated control cells/patients should be more similar to original control cells/patients than original diseased cells/patients **should be high**
- generated diseased cells/patients should be more similar to original diseased cells/patients than original control data **should be high**
- The difference between generated control and diseased cells/patients should be as close as possible to the difference between original control and diseased cells/patients. **should be low**

## Process old
1. Train a generator on each group seperatly on the original data.
2. Embedd the orignal data of each group seperately -> take each patients seperatly 
3. Generate patients -> embedd each patient using the same embedding space as the patients respective group.
4. Get the KL-divergence between each pair of patient.
    1. the kl-divergence is measured on the meshgrid spanning the union of each pair of patient cells.
    2. Get two kernel density estimators, each trained on their respectiv patients embedding. Then use KL-divergence to compare the the estimations over the meshgrid.
5. Use the double-date metric too compare generators.


## Process new
1. Train a generator on each group seperatly on the original data.
2. Embedd the orignal data of each group  -> take each patients seperatly 
3. Generate patients -> embedd each patient using the same embedding space.
4. Get the KL-divergence between each pair of patient.
    1. the kl-divergence is measured on the meshgrid spanning the union of each pair of patient cells.
    2. Get two kernel density estimators, each trained on their respectiv patients embedding. Then use KL-divergence to compare the the estimations over the meshgrid.
5. Use the double-date metric too compare generators.



## Critisim
1. Factors that might affect the metric
    1. The embedding algorthim. The properties of some algorithm used will transfer to the metric. **Embedding groups in different space, will make the difference already big**
    2. The way KL-divergence is used. The span of the meshgrid will vary.
    3. The kernel density estimator. Different parameters might give different results.
    4. The generators themselves will vary with hyperparameters. (These are the models we to validate.)
    5. The number of generated patients might affect metric. 
    6. The double-date metric might not be a good metric itself.

### 1.  Train a generator on each group seperatly on the original data.

In [11]:
import pickle
import numpy as np

In [23]:
import pandas as pd

In [21]:
# load generated data 
# or load some generator 
# or train a generator - do this in another file
def load_patients(file):
    """
    PARAMETERS
    ----------
    file: csv with cells of patients.
    
    RETURNS
    -------
    df: Dataframe version of the csv file
    
    """
    return pd.read_csv(file)


def load_generator(file):
    pass # TODO


def train_generator():
    pass # TODO

### 2. Embedd each group seperately -> take each patients seperatly OLD

In [15]:
from openTSNE import TSNE
from openTSNE.callbacks import ErrorLogger

In [1]:
import umap

In [36]:
def umap_embedd(df, embedder=None):
    """
    PARAMETERS
    ----------
    df: dataframe with cells of patients
    embedder: umap embedding with existing embedded space.
    
    RETURNS
    -------
    embedder: umap embedder with existing embedded space
    
    embedded: embedding of df using embedder
    
    """
    
    group_embedding = dict()
    
    if(not embedder):
        embedders = {"control": umap.UMAP(n_components=2, n_neighbors=10, random_state=0),
                    "diseased": umap.UMAP(n_components=2, n_neighbors=10, random_state=0)}

    groups = ["control", "diseased"]
    for g in groups:
        group = df.groupby("groub").get_group(g)
        group_cells = group[group.columns.difference(["id","group"])]
        group_embedding[g] = embedder.fit_transform(group_cells)
    return up, group_embedding
    
    
def tsne_embedd(df, embedders=None):  
    """
    PARAMETRS
    ---------
    df: dataframe with cells of patients
    embedders: OpenTSNE - TSNEEmbedding object, one for each group of patient
    
    RETURNS
    -------
    embedders: tsne embedders with existing embedded space, one for each group of patient
    
    embedded: embedding of df using embedder
    
    """
    group_embedding = dict()
    groups = ["control", "diseased"]
    if(not embedders):
        embedders = {"control": TSNE(n_components=2, perplexity=30, 
                    learning_rate=20, n_iter=1000, 
                    random_state=0),
                    "diseased": TSNE(n_components=2, perplexity=30, 
                    learning_rate=20, n_iter=1000, 
                    random_state=0)}
        
        for g in groups:
            group = df.groupby("group").get_group(g)
            group_cells = group[group.columns.difference(["id","group"])]
            group_embedding[g] = embedders[g].fit(group_cells.values)
    else:
        for g in groups:
            group = df.groupby("group").get_group(g)
            group_cells = group[group.columns.difference(["id","group"])]
            group_embedding[g] = embedders[g].transform(group_cells.values)
            
        
    return embedders, group_embedding

### 3. Get the KL-divergence between each pair of patient.

In [None]:
from sklearn.neighbors import KernelDensity

In [None]:
def get_kl_between_pairs(original, generated):
    """
    PARAMETERS
    ----------
    original: a dictionary containing the embedded control and patient of the original patients
    
    generated: a dictonary containing the embedded control and patient of the generated patients
    
    RETURNS
    -------
    pairwise_kl: a list of matrices. each matrix contains a kl between two groups.
    
    """
    def KL(a,b):
        a = np.asarray(a, dtype=np.float)
        b = np.asarray(b, dtype=np.float)
        a = np.exp(np.where(a!= float('-inf'), a, 0))
        b = np.exp(np.where(b!= float('-inf'), b, 0))

        cond = np.logical_and(b != float('-inf'), a!= float('-inf'), b != 0)
    
        return np.sum(np.where(cond, a * np.log(a / b), 0))
    
    def meshgrid(min_value = -15,max_value=15):
        grid_margins = [np.linspace(min_value, max_value, 120)] * 2
        grid = np.stack(np.meshgrid(*grid_margins), -1).reshape(-1, len(grid_margins))
        return grid

    def KL_between_two_patients(p1, p2, symmetric=True):
        max_1 = np.max(p1)
        min_1 = np.min(p1)
        max_2 = np.max(p2)
        min_2 = np.min(p2)
        X = meshgrid(min_value = np.min([min_1, min_2])-1, max_value = np.max([max_1,max_2])+1 )

        P_k_density_estimator = KernelDensity(kernel = "linear", bandwidth=1).fit(p1)
        P_sample_score = P_k_density_estimator.score_samples(X)

        Q_k_density_estimator = KernelDensity(kernel = "linear", bandwidth=1).fit(p2)
        Q_sample_score = Q_k_density_estimator.score_samples(X)
        kl_score = 0

        if(symmetric):
            kl_score = KL(P_sample_score, Q_sample_score) + KL(Q_sample_score, P_sample_score)
        else:
            kl_score = KL(P_sample_score, Q_sample_score)

        return kl_score
    
    
    
    KL_sum_cc = np.empty((20,20))
    KL_sum_dd = np.empty((20,20))
    KL_sum_cd = np.empty((20,20))
    KL_sum_dc = np.empty((20,20))
    for i in range(20):
        for j in range(20):# Get the KL divergence between each pair of sample and original patient
            KL_sum_cc[i,j] = KL_between_two_patients(generated["control"][i], original["control"][j])
            KL_sum_dd[i,j] = KL_between_two_patients(generated["diseased"][i], original["diseased"][j])
            KL_sum_cd[i,j] = KL_between_two_patients(generated["diseased"][i], original["control"][j])
            KL_sum_dc[i,j] = KL_between_two_patients(generated["control"][i], original["diseased"][j])

    return [KL_sum_cc, KL_sum_dd, KL_sum_cd, KL_sum_dc]

### 4. Double-date matric

### 5. Visualization of the pairwise KL-divergence

In [19]:
def plot_heatmap(KL_sum_cc, KL_sum_dd, KL_sum_cd, KL_sum_dc):
    ab = np.concatenate((KL_sum_cc,KL_sum_cd),axis=1)
    cd = np.concatenate((KL_sum_dc,KL_sum_dd),axis=1)
    abcd = np.concatenate((ab,cd),axis=0)
    
    plt.figure(fig_size=(16,16))
    sns.set()
    ax = sns.heatmap(abcd)
    ax.set_xlabel("original c d")
    ax.set_ylabel("sampled d c")

In [20]:
def plot_boxplot(data, title):
    fig, ax = plt.subplots(1, len(response), sharex=True, sharey=True)
    fig.set_figheight(10)
    fig.set_figwidth(18)
    for i, r in enumerate(response):
        sns.boxplot(data=data[i], ax=ax[i])
        ax[i].set_title(r)
    ax[0].set_ylabel('average KL divergence')
    fig.subplots_adjust(wspace=0)

# Main

In [None]:
#1 load original and generated data
df_org = load_patients("original_patients.csv") # TODO
df_gen = load_patients("generated_patients.csv") # TODO

#2 Embedd data TODO: turn group_embeddings into dataframes
tsne_org, group_embedding_tsne_org = embedd_tsne(df_org)
tsne_gen, group_embedding_tsne_gen = embedd_tsne(df_gen, tsne_org)

umap_org, group_embedding_umap_org = embedd_umap(df_org)
umap_gen, group_embedding_umap_org= embedd_umap(df_gen, umap_org)

#3 get kl_div



title_8 = ['CC_8', 'CD_8', 'DC_8', 'DD_8']
data_8 = [KL_sum_cc_8, KL_sum_cd_8, KL_sum_dc_8, KL_sum_dd_8]
box_plot(data_8, title_8)