# Triku stability measures

In this notebook we will calculate certain stability and robustness measures, that is, we are goin to test certain parameters in triku and see how different the selected genes are.

The main measurements to take into account will be **PCA number of components**, and **knn**; since those two components end up taking a bunch of computation time. By default PCA components are set to 25 and knn is set to $0.5\sqrt{n_{cells}}$. For large datasets (> 20k cells), the processing can take a couple of minutes (assuming parallel processing, else it takes ~ 4 minutes), mainly because of the calculation of PCA and knn indices. Also, calculation of distances using the **randomized matrix** doubles the time amount, because all steps have to be repeated using the randomized matrix. 

Thus, in this step we are going to see how each of those variables affects the number of selected genes. To do that we will use a set of benchmarking dataset by Mereu et al. The datasets are PBMC from human and colon cells from mouse, with different library preparation methods: Chromium, CEL-seq, SMART-seq2, QUARTZ-seq, InDrop, ddSEQ, and snChromium. The dataset is interesting to study stability across different library preparation methods, but also considering two different organisms and tissues. We will also include artificial datasets to include a ground truth when it is necessary.

There is another dataset, from Ding et al. with a similar benchmark, which is also included here, in this dataset they apply Smart-seq2, CEL-seq2, 10x chromium (V2 and V3), Drop-seq, Seq-well, inDrops, sci-RNAseq. They share 10x, Smart-seq2, CEL-seq (not version), inDrops. We should then have a fair set of datasets to apply comparisons to.

In [None]:
%load_ext autoreload

In [None]:
%autoreload 2

In [None]:
%matplotlib inline

In [None]:
import triku as tk
import scanpy as sc
import pandas as pd
import numpy as np

from bokeh.io import show, output_notebook, reset_output
from bokeh.plotting import figure
from bokeh.models import LinearColorMapper

import matplotlib.pyplot as plt
import matplotlib as mpl

from tqdm.notebook import tqdm
from itertools import product

reset_output()
output_notebook()

In [None]:
seed = 0

In [None]:
import sys, os
sys.path.insert(0, os.getcwd() + '/code')

# Selection of palettes for cluster coloring, and scatter values
from triku_nb_code.palettes_and_cmaps import magma, bold_and_vivid
from triku_nb_code.robustness_functions import run_all_batches
from triku_nb_code.robustness_functions import run_batch, random_noise_parameter, plot_scatter_parameter, compare_parameter, \
get_all_pics_dataset, plot_scatter_datasets


In [None]:
os.makedirs(os.getcwd() + '/exports/robustness/', exist_ok=True)

In [None]:
data_dir = os.path.dirname(os.getcwd()) + '/data/'

# csv generation
To do the analysis across conditions, we will generate all possible combinations, and later on do the analysis.

In [None]:
save_dir = os.getcwd() + '/exports/robustness/'
read_dir = os.getcwd() + '/data/Mereu_2020/'

lib_preps = ['SingleNuclei', 'Dropseq', 'inDrop', '10X', 'SMARTseq2', 'CELseq2', 'QUARTZseq']
orgs = ['mouse', 'human'] 

run_all_batches(lib_preps, orgs, 'mereu', read_dir, save_dir) # Uncomment to run!

In [None]:
save_dir = os.getcwd() + '/exports/robustness/'
read_dir = os.getcwd() + '/data/Ding_2020/'

lib_preps = ['10X', 'CELseq2', 'Dropseq', 'inDrop', 'sci-RNAseq', 'Seq-Well', 'SingleNuclei', 'SMARTseq2']
orgs = ['human', 'mouse']

run_all_batches(lib_preps, orgs, 'ding', read_dir, save_dir)  # Uncomment to run!

# Robustness between different parameter values
In this section we are going to compare the overlapping percentage of number of features given different parameter values. As in the previous section, we are going to fix:
* number of kNN in $\sqrt{N}$, number of windows in 100 to see changes in PCA components.
* number of PCA components in 30, number of windows in 100 to see changes in kNN values.
* additionally, we are going to fix the number of PCA components in 30 and number of kNN in $\sqrt{N}$ to see changes on number of windows for median correction. 

The strategy in this case will be the same: consider distance values for each of the parametters, and calculate the overlap between the first N features. For example, when comparing kNN values, we are going to compare the values from $\sqrt{N}$, seed 0 with $2\sqrt{N}$ seed 1, seed 2, etc. We will also compare $\sqrt{N}$ with itself, which has already been done, but which will still be useful.

We will also apply the Pearson correlation between the distances for the first N features. Pearson correlation, in contrast to overlap of features, will be more robust, but less realiable, because we are interested in the selected features.

In [None]:
lib_prep, org, dataset, save_dir = 'ding', 'human', '10X', os.getcwd() + '/exports/robustness/'

In [None]:
df_violin_0_500 = compare_parameter(lib_prep, org, dataset, save_dir, 0, 500, what='overlap', by='knn')
df_violin_0_1000 = compare_parameter(lib_prep, org, dataset, save_dir, 0, 1000, what='overlap', by='knn')
df_violin_0_2500 = compare_parameter(lib_prep, org, dataset, save_dir, 0, 2500, what='overlap', by='knn')
df_violin_0_5000 = compare_parameter(lib_prep, org, dataset, save_dir, 0, 5000, what='overlap', by='knn')

# Remember: left = non randomized - right = randomized
plot_scatter_parameter([df_violin_0_5000, df_violin_0_2500, df_violin_0_1000, df_violin_0_500], 
   ['0 - 5000',  '0 - 2500', '0 - 1000', '0 - 500'], 
    lib_prep, org, dataset, step=1, palette = 'sunsetmid4', by='knn',
        title='kNN_robustness,_overlap', ylabel="Overlap", plot_random=False)

In [None]:
df_violin_0_500 = compare_parameter(lib_prep, org, dataset, save_dir, 0, 500, what='correlation', by='knn')
df_violin_0_1000 = compare_parameter(lib_prep, org, dataset, save_dir, 0, 1000, what='correlation', by='knn')
df_violin_0_2500 = compare_parameter(lib_prep, org, dataset, save_dir, 0, 2500, what='correlation', by='knn')
df_violin_0_5000 = compare_parameter(lib_prep, org, dataset, save_dir, 0, 5000, what='correlation', by='knn')

# Remember: left = non randomized - right = randomized
plot_scatter_parameter([df_violin_0_5000, df_violin_0_2500, df_violin_0_1000, df_violin_0_500], 
   ['0 - 5000',  '0 - 2500', '0 - 1000', '0 - 500'], 
    lib_prep, org, dataset, step=1, 
    palette = 'sunsetmid4', by='knn',
    title='kNN_robustness,_correlation', 
    ylabel='Pearson correlation', plot_random=False)

There is a general trend across datasets: overlap across kNN values decreases quite rapidly, so kNN is a sensitive parameter to choose. Generally, values between $\sqrt{N}/2$ and $\sqrt{N}$ have similar overlaps (70-80% in worst case scenario). The overlaps with larger knn values, interestingly, decreases more rapidly. This might be because, since many of the features of interest have a lower expression as a whole, or more concentrated across datasets, such a high kNN value can imply a selection of cells larger than the original number of cells of interest for a interesting feature, and therefore the kNN expression for the cells of interest is mixed with noise from cells that are not that interesting. 

For example, if a population in a 10000 cell dataset has 150 cells with a characteristic expression pattern, using $5\sqrt{N} = 500$ cells will include zero or noisy counts from 350 cells. Therefore, the kNN expression will be noisier and the features will not be selected as wells as with k = $\sqrt{N} = 100$ cells.

Therefore, if you expect a small subpopulation, it might even be better to set $0.5\sqrt{N}$ as the preferable k. This value is set by default.

Generally, for datasets with higher number of detected genes the overlap is greater, around 85-90% between $\sqrt{N}/2$ and $\sqrt{N}$. 

Although it depends on the dataset, there is a general trend that overlaps are smaller for a more reduced number of features, whereas for higher number of features the overlap is greater. I do not know why this might happen. Possibly, there is a range of slected features in the 2000 to 5000 scale that will always be selected because the rest is expression noise, or much less defined genes, that is, genes with much less localized expression patterns.

We don't see a clear variation due to randomization.

Regarding correlation values, they all are much higher, above 0.9 or 0.95 for most cases. There is a more marked trend of higher corelation values up to $2\sqrt{N}$, with a sudden drop in $5\sqrt{N}$.

In [None]:
df_violin_0_500 = compare_parameter(lib_prep, org, dataset, save_dir, 0, 500, what='overlap', by='pca')
df_violin_0_1000 = compare_parameter(lib_prep, org, dataset, save_dir, 0, 1000, what='overlap', by='pca')
df_violin_0_2500 = compare_parameter(lib_prep, org, dataset, save_dir, 0, 2500, what='overlap', by='pca')
df_violin_0_5000 = compare_parameter(lib_prep, org, dataset, save_dir, 0, 5000, what='overlap', by='pca')

# Remember: left = non randomized - right = randomized
plot_scatter_parameter([df_violin_0_5000, df_violin_0_2500, df_violin_0_1000, df_violin_0_500], 
   ['0 - 5000',  '0 - 2500', '0 - 1000', '0 - 500'], 
    lib_prep, org, dataset, step=1, palette = 'sunsetmid4', by='pca',
                       title='PCA_robustness,_overlap', 
    ylabel='Overlap', plot_random=False)

In [None]:
df_violin_0_500 = compare_parameter(lib_prep, org, dataset, save_dir, 0, 500, what='correlation', by='pca')
df_violin_0_1000 = compare_parameter(lib_prep, org, dataset, save_dir, 0, 1000, what='correlation', by='pca')
df_violin_0_2500 = compare_parameter(lib_prep, org, dataset, save_dir, 0, 2500, what='correlation', by='pca')
df_violin_0_5000 = compare_parameter(lib_prep, org, dataset, save_dir, 0, 5000, what='correlation', by='pca')

# Remember: left = non randomized - right = randomized
plot_scatter_parameter([df_violin_0_5000, df_violin_0_2500, df_violin_0_1000, df_violin_0_500], 
   ['0 - 5000',  '0 - 2500', '0 - 1000', '0 - 500'], 
    lib_prep, org, dataset, step=1, 
    palette = 'sunsetmid4', by='pca', 
    title='PCA_robustness,_correlation', 
    ylabel='Pearson correlation', plot_random=False)

When we look at PCA component-based overlap, there is not a clear trend. Generally, overlap with cases of less than 20 components tend to be really different, and a number of components between 20 and 50 have similar overlap values.

Again, library preparation methods that yield higher number of detected genes tend to score higher, which might be expected, because the *resolution* per gene will be better.

We don't see a clear variation due to randomization.

Correlation values are above 0.95 in all cases, regardless of dataset. The trends are the same as with overlap, but much marked.

In [None]:
df_violin_0_500 = compare_parameter(lib_prep, org, dataset, save_dir, 0, 500, what='overlap', by='w')
df_violin_0_1000 = compare_parameter(lib_prep, org, dataset, save_dir, 0, 1000, what='overlap', by='w')
df_violin_0_2500 = compare_parameter(lib_prep, org, dataset, save_dir, 0, 2500, what='overlap', by='w')
df_violin_0_5000 = compare_parameter(lib_prep, org, dataset, save_dir, 0, 5000, what='overlap', by='w')

# Remember: left = non randomized - right = randomized
plot_scatter_parameter([df_violin_0_5000, df_violin_0_2500, df_violin_0_1000, df_violin_0_500], 
   ['0 - 5000',  '0 - 2500', '0 - 1000', '0 - 500'], 
    lib_prep, org, dataset, step=1, palette = 'sunsetmid4', by='w', title='window_robustness,_overlap', 
    ylabel='Overlap', plot_random=False)

In [None]:
df_violin_0_500 = compare_parameter(lib_prep, org, dataset, save_dir, 0, 500, what='correlation', by='w')
df_violin_0_1000 = compare_parameter(lib_prep, org, dataset, save_dir, 0, 1000, what='correlation', by='w')
df_violin_0_2500 = compare_parameter(lib_prep, org, dataset, save_dir, 0, 2500, what='correlation', by='w')
df_violin_0_5000 = compare_parameter(lib_prep, org, dataset, save_dir, 0, 5000, what='correlation', by='w')

# Remember: left = non randomized - right = randomized
plot_scatter_parameter([df_violin_0_5000, df_violin_0_2500, df_violin_0_1000, df_violin_0_500], 
   ['0 - 5000',  '0 - 2500', '0 - 1000', '0 - 500'], 
    lib_prep, org, dataset, step=1, palette = 'sunsetmid4', by='w',
    title='window_robustness,_correlation',
    ylabel="Pearson correlation", plot_random=False)

Robustness on windows is really high. Overlap values are higher than 90%. Correlation values are much higher, above 0.99. There is a *symmetrical* decrease of over number of windows.

# Iterate in all datasets!
Since there are many datasets to iterate on, the same code used in the notebook is within `robustness_functions.py` so that it can be run in all datasets.

In [None]:
listdir = os.listdir(save_dir)
for org in ['human', 'mouse']:
    for lib_prep in ['SingleNuclei', 'inDrop', '10X', 'SMARTseq2', 'CELseq2', 'QUARTZseq', 'Dropseq', 'sci-RNAseq', 'Seq-Well',]:
        for dataset in ['mereu', 'ding']:
            matchfiles = [i for i in listdir if org in i and lib_prep in i and dataset in i]
            if matchfiles:
                print(org, dataset, lib_prep)
                try:
                    get_all_pics_dataset(lib_prep, org, dataset, save_dir, plot_random=False)
                    plt.show()
                except:
                    pass

# Comparison of robustness across datasets
In this section we are going to compare overlap values across kNN, PCA components and windows, using different datasets. To do that we are going to use three number of feature values (0-500 / 1000 / 2500) and plot for 1000 a line plot and a between_plot between 500 and 2500. The point that will be plotted will be the mean of overlaps between the different seeds.

In [None]:
what = 'overlap'
low_val, mid_val, hi_val = 500, 1500, 2500

In [None]:
for dataset in ['ding', 'mereu', ]:
    if dataset == 'mereu':
        lib_preps = ['SingleNuclei', 'inDrop', '10X', 'SMARTseq2', 'CELseq2', 'QUARTZseq',]
    else:
        # Ding may fail because some libraries do not exist for mouse. I don't care
        # because the most important results for PCA and kNN turn out right, and are 
        # the variables I care most for.
        lib_preps = ['SingleNuclei', 'inDrop', '10X', 'SMARTseq2', 'CELseq2', 'Seq-Well',]
    for org in ['human']:

        list_dicts_dfs = [{}, {}, {}]
        for lib_prep in lib_preps:
            list_dicts_dfs[0][lib_prep] = compare_parameter(lib_prep, org, dataset, save_dir, 0, low_val, what=what, by='knn')
            list_dicts_dfs[1][lib_prep] = compare_parameter(lib_prep, org, dataset, save_dir, 0, mid_val, what=what, by='knn')
            list_dicts_dfs[2][lib_prep] = compare_parameter(lib_prep, org, dataset, save_dir, 0, hi_val, what=what, by='knn')

        plot_scatter_datasets(list_dicts_dfs, org, by='knn', figsize=(7, 4),  palette='bold',
                                   title=f'General_kNN_robustness_{dataset}_{org}', ylabel=what, 
                              save_dir=os.getcwd() + '/figures/robustness_figs')


        list_dicts_dfs = [{}, {}, {}]
        for lib_prep in lib_preps:
            list_dicts_dfs[0][lib_prep] = compare_parameter(lib_prep, org, dataset, save_dir, 0, low_val, what=what, by='pca')
            list_dicts_dfs[1][lib_prep] = compare_parameter(lib_prep, org, dataset, save_dir, 0, mid_val, what=what, by='pca')
            list_dicts_dfs[2][lib_prep] = compare_parameter(lib_prep, org, dataset, save_dir, 0, hi_val, what=what, by='pca')

        plot_scatter_datasets(list_dicts_dfs, org, by='pca', figsize=(7, 4),  palette='bold',
                                   title=f'General_PCA_robustness_{dataset}_{org}', 
                              ylabel=what, save_dir=os.getcwd() + '/figures/robustness_figs')


        list_dicts_dfs = [{}, {}, {}]
        for lib_prep in lib_preps:
            list_dicts_dfs[0][lib_prep] = compare_parameter(lib_prep, org, dataset, save_dir, 0, low_val, what=what, by='w')
            list_dicts_dfs[1][lib_prep] = compare_parameter(lib_prep, org, dataset, save_dir, 0, mid_val, what=what, by='w')
            list_dicts_dfs[2][lib_prep] = compare_parameter(lib_prep, org, dataset, save_dir, 0, hi_val, what=what, by='w')

        plot_scatter_datasets(list_dicts_dfs, org, by='w', figsize=(7, 4),  palette='bold',
                                   title=f'General_window_robustness_{dataset}_{org}', 
                              ylabel=what, save_dir=os.getcwd() + '/figures/robustness_figs')