## Metric for comparing the samples from generator of patients vs samples of real patients 

<img name="patient comparator" src="prComparator.png" width=300 height=300> </img>    

The nr of patients should not matter in either of the original data or the generated data 

## Definitions 

<table>
    <tr>
        <td>$Symbol$ </td>
        <td>$Definition$</td>
    </tr>
    <tr>
        <td>$con$</td>
        <td>control patient group</td>
    </tr>
    <tr>
        <td>$dis$</td>
        <td>diseased patient group</td>
    </tr>
    <tr>
        <td>$gen$</td>
        <td>generated patients</td>
    </tr>
    <tr>
        <td>$ori$</td>
        <td>original patients</td>
    </tr>
    <tr>
        <td>$g \in \{control, diseased\}$</td>
        <td>patient group</td>
    </tr>
    <tr>
        <td>$t \in \{original, generated\}$ </td>
        <td>data type</td>  
    </tr>
    <tr>
        <td>$c^j_i$</td>
        <td>cell nr. $j$ of patient $i$</td>
    </tr>
    <tr>
        <td>$\sigma_i$ </td>
        <td>group and type $(g,t)$ of patient $i$</td>
    </tr>
    <tr>
        <td>$P_i = \{ c^1_i, c^2_i, \cdots, c^n_i\}$</td>
        <td> cells of patient $i$</td>
    </tr>
    <tr>
        <td>$U^g$</td>
        <td>Umap embedding trained on all cells of group $g$ of the $original$ patients</td>
    </tr>
    <tr>
        <td>$M_i = U^g(P_i)$</td>
        <td> cells of controll patient $i$ embedded using $U^g$, where $g,t \leftarrow$ $\sigma_i$</td>
    </tr>
    <tr>
        <td>$G_{i,j}$</td>
        <td>a mesh grid ranging from min($M_i$,$M_j$) to the max($M_i$, $M_j$)</td>
    </tr>
    <tr>
        <td>$p_i = p(M_i)$</td>
        <td>kernel denisty estimation over the embedded cells of patient $i$</td>
    </tr>
    <tr>
        <td>$KL_p(P_i, P_j)$</td>
        <td>the kullback leibler divergence between to patients measured over a mesh grid $G_{i,j}$ (to measure the density over the same space) using $p_i$ for $P_i$ and $p_j$ for $P_j$ <br>$p$ indicated this measures the difference between two patients</td>
    </tr>
    <tr>
        <td>$S_\theta$</td>
        <td>the set of all patients from the group g and data t, where $[g,t] \leftarrow \theta$</td>
    </tr>
    <tr>
        <td>$H(\theta_1,\theta_2)$</td>
        <td>$\{KL_p(P_i, P_j) | P_i \in S_{\theta_1} \wedge P_j\in S_{\theta_2}\}$</td>
    </tr>
    <tr>
        <td>$D(\theta_1,\theta_2)$</td>
        <td>the destribution of $H(\theta_1,\theta_2)$ estimated using kernel denisty estimation </td>
    </tr>
    <tr>
        <td>$\Sigma_{\theta_1,\theta_2}$</td>
        <td>the sum of $H(\theta_1,\theta_2)$</td>
    </tr>
</table>

## We consider a generative model to be good when
1. <i style="color:brown">$D(\textit{[con,ori]},\textit{[dis,ori]})$</i> is as similar to <i style="color:brown">$D(\textit{[con,gen]},\textit{[dis,gen]})$</i> as possible. Indicated with the arrows by their respective color in figure 1. 
2. <i style="color:green">$\sum_{\textit{[con,ori]},\textit{[con,gen]}}$</i> is as small as possible
3. <i style="color:green">$\sum_{\textit{[dis,ori]},\textit{[dis,gen]}}$</i> is as small as possible

## The pruposed metric
Thus a simple formula for measuring the performance (how good are the patients generated) of a generative model  can be:

$$Loss(model) = KL_s( D(\textit{[con,ori]},\textit{[dis,ori]}), D(\textit{[con,gen]},\textit{[dis,gen]}) ) + \sum_{\textit{[con,ori]},\textit{[con,gen]}} + \sum_{\textit{[dis,ori]},\textit{[dis,gen]}}$$

Because KL divergence, $KL(P|Q)$ can be thought as the information loss when $Q$ is used to estimate $P$. Here, $KL_s(P|Q)$ is the symmetric KL divergence.

In [None]:
def H(h1,h2):
    div = []
    for i in h1:
        for j in h2:
            div.append(KL_patient(i,j))
    return div

In [None]:
def KLc(d1, d2):
    max_1 = np.max(p1)
    min_1 = np.min(p1)
    max_2 = np.max(p2)
    min_2 = np.min(p2)
    X = meshgrid(min_value = np.min([min_1, min_2])-1, max_value = np.max([max_1,max_2])+1 )

    density1 = d1[0]
    
    pass
    
def D(t1, t2):
    h = H(t1, t2)
    return KernelDensity(kernel = "linear", bandwidth=1).fit(h), h
    
def S(t2, t2):
    return np.sum(H(t2,t2))

In [None]:
def model_loss():
    co = [] # TODO list of control original patients
    do = [] # TODO list of deseased original patients
    cg = [] # TODO list of control generated patients
    cg = [] # TODO list of deseased generated patients

    sim = KLc(D(co,do), D(cg,dg))
    sim1 = S((con,ori), (con,gen))
    sim2 = S((dis,ori), (dis,gen))
    return sim + sim1 + sim2

## Need an online version of the density estimator for the variational autoencoder.

In [None]:
 def KL(a,b):
        """
        PARAMETERS
        ----------
        a : numpy array
        b : numpy array
        
        RETURNS
        -------
        kl divergence : int
            the nonsymmetric kulback liebler divergence between a and b. 
        """
        a = np.asarray(a, dtype=np.float)
        b = np.asarray(b, dtype=np.float)
        a = np.exp(np.where(a!= float('-inf'), a, 0))
        b = np.exp(np.where(b!= float('-inf'), b, 0))

        cond = np.logical_and(b != float('-inf'), a!= float('-inf'), b != 0)
    
        return np.sum(np.where(cond, a * np.log(a / b), 0))

### Questions

#### Master
1. Should I use the mean or another point statistic for 2 and 3 to make the metric size invariant? (Median is more stable, because mean is affected by outliers)
2. The bottle neck is still measureing kl.
    - use a meshgrid generated from the original data plus-minus 10% span. And evaluate the mesh per patient, not per comparison. And write about the reasoning. 


#### Project
1. Testing is difficualt because there is randomness that must be included in the code.
2. When I create an example, should I 
    - add a printed version on github?
    - as a program that can be run and include it in the library, i.e. a main function that can be called?
    

### TODO
1. Implement the double-date metric
2. Out of the selected models (GMM, VAE, GANs, NaiveBayesian), find the best model using the double-date metric.