# Using triku with non-integer datasets

Triku's key step is the convolution of reads. Convolution can be applied to either discrete or continuous distributions. However, scanpy's convolution function takes a numeric vector where each step is a unit. When datasets are continuous we cannot then do the convolution, because the vector to represent each value in the real axis would be infinite. However, we can approach this continuous convolution by discretizing it, that is, by bining the continuous values into discrete categories, and applying convolution on them.

The binning step is controlled by the `n_divisions` parameter. This means that the unit will be divided into `n_divisions` parts. For example, if `n_divisions` is 4, then all continuous values between X.0 and X.25 will be gathered into the same bin, and so on. By doing that, we can have read count and knn count arrays with discrete values (the first 4 values would the 0 to 1, the next 4 would be from 1 to 2, and so on). We can now apply convolution, just because the unit has been divided into discrete subunits, with the only difference that, in the end, we will *revert back* the scale on the x axis by dividing it by `n_divisions`.

**Why is this important?** 
There are some datasets that might not have discrete read counts. For example, `alevin` read mapper takes into account read mapping to different isoforms or different genes. The mapping is therefore split into all candidates, and we may get non-integer read counts. 
In that case, triku would not be able to apply the convolution, and would fail to produce any distance. Now, it is able to do so.

Also, some datasets might already be log transformed, in which case it is highly likely that no value is an integer. In those cases we could round each value and apply triku, but the distortions would be huge.


**Purpose**
In this notebook we are going to apply triku to 3 10X datasets (neuron, pbmc, heart), which have been mapped with alevin. We are going to test triku in three conditions:
* Integer data: we make sure that selected features and wasserstein distances are the same regardless of `n_divisions`.
* Original data: some counts will no be integer (0.5, 1.333, etc.). We will run triku with several `n_divisions` and check the similarity of distances with 100 divisions, and the rest of options. To do that we will calculate the sum of the absolute distances for all genes, and divide it by the number of genes $\frac{1}{n}\sum |a_{100} - a_X|$. Mean absolute distance will increase with smaller `n_divisions`. We consider 100 to be a good approximation of continuity. 
* Log-transformed data: We will log transform (log(x + 1)) each dataset and do the same as with the previous case. In this case dataset will have a wider range of continuous elements, and the convergence of the `n_divisions` will require higher `n_divisions`.

To avoid any random effect, we have set 5 different seeds, and we will do comparisons between results of the same seed. In this way we make sure that the only differing effect is the number of divisions. Take into account that PCA and NN algorithms must yield the same results because the discretization step is performed at the convolution step, which requires PCA and NN to be done beforehand.

In [None]:
%load_ext autoreload

In [None]:
%autoreload 2

In [None]:
import triku as tk
import scanpy as sc
import pandas as pd
import numpy as np
import scipy.sparse as spr
import scipy.stats as sts
import os
import gc
from itertools import product
import pickle
import ray

from tqdm.notebook import tqdm

from bokeh.io import show, output_notebook, reset_output
from bokeh.plotting import figure
from bokeh.models import LinearColorMapper

import matplotlib.pyplot as plt
import matplotlib as mpl
from matplotlib.lines import Line2D

from sklearn.metrics import adjusted_rand_score as ARS
from sklearn.metrics import adjusted_mutual_info_score as NMI
from sklearn.metrics import silhouette_score, davies_bouldin_score

reset_output()
output_notebook()

In [None]:
!python setup.py install

In [None]:
os.makedirs(os.getcwd() + '/exports/continuous/', exist_ok=True)

In [None]:
from triku_nb_code.palettes_and_cmaps import magma, bold_and_vivid, prism

In [None]:
%matplotlib inline

In [None]:
_10x_dir = os.path.dirname(os.getcwd()) + '/data/10x/FASTQs/'

In [None]:
list_n_divisions = [1, 2, 3, 5, 7, 10, 15, 20, 40, 50, 75, 100]

In [None]:
tissues = ['neuron', 'pbmc', 'heart']

In [None]:
a = np.array([True, True, False])
b = np.array([False, True, False])

np.sum(a & b)/np.sum(a | b)

In [None]:
def make_dist_fig(option, title, n_feats = 500):
    fig, axs = plt.subplots(1, len(tissues), figsize=(2.5*len(tissues), 2.5))
    for tissue_i, tissue in enumerate(tissues):
        ax2 = axs[tissue_i].twinx() 
        
        for seed in range(5):
            df = pd.read_csv(os.getcwd() + f'/exports/continuous/{tissue}_seed-{seed}_emd-distance_{option}.csv')
            list_distances_divs = []
            list_jaccard = []
            
            thresh_last = np.sort(df[f'div_{list_n_divisions[-1]}'].values)[-500]
            
            for x in list_n_divisions:
                list_distances_divs.append(np.mean(np.abs(df[f'div_{x}'].values - df[f'div_{list_n_divisions[-1]}'].values)))
                
                thresh_x = np.sort(df[f'div_{x}'].values)[-500]
                
                greater_x, greater_last = df[f'div_{x}'].values > thresh_x, df[f'div_{list_n_divisions[-1]}'].values > thresh_last
                jac = np.sum(greater_x & greater_last)/np.sum(greater_x | greater_last)
                list_jaccard.append(jac)
                
            axs[tissue_i].plot(list_n_divisions[:], list_distances_divs[:], c="#606060")
            ax2.plot(list_n_divisions[:], list_jaccard[:], c="firebrick", alpha = 0.3)

        axs[tissue_i].tick_params(axis='y', colors="#606060")
        ax2.tick_params(axis='y', colors='firebrick')
        ax2.set_ylim([0.45, 1.05])
        ax2.set_yticks([0.5, 0.6, 0.7, 0.8, 0.9, 1]); ax2.set_yticklabels([0.5, 0.6, 0.7, 0.8, 0.9, 1])
        
        for ax in [ax2, axs[tissue_i]]:
            ax.set_xscale('log')
            ax.set_xticks(list_n_divisions[::2])
            ax.set_xticklabels(list_n_divisions[::2])
            
        
        axs[tissue_i].set_title(tissue)
#     fig.suptitle(title, y=1.05)
    axs[0].set_ylabel('Mean absolute distance', c="#606060")
    ax2.set_ylabel(f'Jaccard index (of {n_feats} features)', c="firebrick")
    
    plt.tight_layout()
    os.makedirs(os.getcwd() + '/figures/continuous_figs/', exist_ok=True)
    plt.savefig(os.getcwd() + f'/figures/continuous_figs/comparison_{option}.pdf')     

In [None]:
for tissue in tissues:
    adata = sc.read_h5ad(_10x_dir + f'alevin_output_{tissue}/{tissue}_10k_v3_filtered_feature_bc_matrix.h5')
    adata.var_names_make_unique()
    adata.X = adata.X.astype(int)
    sc.pp.filter_cells(adata, min_counts=400)
    sc.pp.filter_genes(adata, min_counts=100)
    
    
    for seed in range(5):
#         if os.path.exists(os.getcwd() + f'/exports/continuous/{tissue}_seed-{seed}_emd-distance_int.csv'):
#             continue  
            
        df = pd.DataFrame(columns=[f'div_{i}' for i in list_n_divisions], 
                          index=adata.var_names)
        for n_divisions in list_n_divisions:
            print(tissue, n_divisions, seed)
            tk.tl.triku(adata, apply_background_correction=False, n_divisions=n_divisions, n_procs=8, random_state=seed)
            df.loc[:, f'div_{n_divisions}'] = adata.var['triku_distance'].values
            
        
        df.to_csv(os.getcwd() + f'/exports/continuous/{tissue}_seed-{seed}_emd-distance_int.csv') 

In [None]:
make_dist_fig(option='int', title='Mean absolute distance (integer)')

We see that distance variation is minimal ($10^{-10}$), that is, the convolution is stable and yields the same values if datasets are integers. There is however a slight decrease in distance with higher `n_division` which might be because of small variations on how the convolution is calculated. However, those values are still insignificant and still considered 0.

In [None]:
for tissue in ['neuron', 'pbmc', 'heart']:
    adata = sc.read_h5ad(_10x_dir + f'alevin_output_{tissue}/{tissue}_10k_v3_filtered_feature_bc_matrix.h5')
    adata.var_names_make_unique()
    sc.pp.filter_cells(adata, min_counts=400)
    sc.pp.filter_genes(adata, min_counts=100)
    
    for seed in range(5):
#         if os.path.exists(os.getcwd() + f'/exports/continuous/{tissue}_seed-{seed}_emd-distance_float.csv'):
#             continue        
        df = pd.DataFrame(columns=[f'div_{i}' for i in list_n_divisions], 
                          index=adata.var_names)
        for n_divisions in list_n_divisions:
            print(tissue, n_divisions, seed)
            tk.tl.triku(adata, apply_background_correction=False, n_divisions=n_divisions, n_procs=8, random_state=seed)
            df.loc[:, f'div_{n_divisions}'] = adata.var['triku_distance'].values
        
        df.to_csv(os.getcwd() + f'/exports/continuous/{tissue}_seed-{seed}_emd-distance_float.csv')             

In [None]:
make_dist_fig(option='float', title='Mean absolute distance')

The mean distance ranges from 0.05 to 0.1 at the beginning, that is, without continuous convolution, there is a mean difference of less than a tenth per gene. This is expected because there will be a certain amount of genes with 1/2, 1/3 or 2/3 (or other fractions) of reads mapped to that gene. When the `n_divisions` increases to 2 or 3, the distance decreases enormously because many of those isoforms are now considered in that convolution, and the distance for these cases is now 0. The slight increase of distance in `n_divisions`=15 is because, being 15 odd, all the isoforms with 1/3, 1/5, etc. will not be exact in the convolution with that number of `n_divisions`. In fact, this phenomenon also occurs with `n_divisions`=7, where only float values of the type 1/7, 1/14, etc. will be exact.

Therefore, the best option for `n_divisions`, if small, should be 6 or 12, which is a multiple of 2, 3, 4, and 6, the most possible fraction of isoforms. Nonetheless, at higher numbers of `n_divisions` the distances plateau. 

In [None]:
for tissue in ['neuron', 'pbmc', 'heart']:
    adata = sc.read_h5ad(_10x_dir + f'alevin_output_{tissue}/{tissue}_10k_v3_filtered_feature_bc_matrix.h5')
    adata.var_names_make_unique()
    sc.pp.filter_cells(adata, min_counts=400)
    sc.pp.filter_genes(adata, min_counts=100)
    sc.pp.log1p(adata)
    
    for seed in range(5):
#         if os.path.exists(os.getcwd() + f'/exports/continuous/{tissue}_seed-{seed}_emd-distance_log.csv'):
#             continue  
            
        df = pd.DataFrame(columns=[f'div_{i}' for i in list_n_divisions], 
                          index=adata.var_names)
        for n_divisions in list_n_divisions:
            print(tissue, n_divisions, seed)
            tk.tl.triku(adata, apply_background_correction=False, n_divisions=n_divisions, n_procs=8, random_state=seed)
            df.loc[:, f'div_{n_divisions}'] = adata.var['triku_distance'].values
        
        df.to_csv(os.getcwd() + f'/exports/continuous/{tissue}_seed-{seed}_emd-distance_log.csv')     

In [None]:
make_dist_fig(option='log', title='Mean absolute distance (logarithm)')

In this case, the logarithmization of integer and fraction numbers yields floating numbers much more variable than only fraction numbers. Therefore, distances decrease without notable peaks like in the case before. In that case, a number of `n_divisions` around 15 - 20 (12 can bee a wise election too) should yield sufficiently small differences for the selected genes to be the same.