# Comparing other feature selection methods with triku
In this notebook we will compare the performance of triku, compared to other methods. 

The methods that will be compared will be the following:
* Select genes with highest variance.   
* Scanpy's `sc.pp.highly_variable_genes`: It is based on Seurat's `vst` method, so they should return similar results.
* scry `devianceFeatureSelection()`. This method is featured as the feature selection for Irizarry's GLM-PCA paper (https://doi.org/10.1186/s13059-019-1861-6). From its description, it computes a deviance statistic for each row feature for count data based on a multinomial null model that assumes each feature has a constant rate. Features with large deviance are likely to be informative. Uninformative, low deviance features can be discarded to speed up downstream analyses and reduce memory footprint. The `fam`parameter will be set to `binomial`, the default.
* M3Drop, which has two main functions:
    * NBDrop: the NBDrop model assumes proportion of zeros follows a Michaelis-Menten model. Then the Michaelis-Menten parameter $K$ is fitted. For each gene, its parameter $K_i$ is compared to $K$ using a $Z$-test, which returns the selected genes.
    * NBUmi: The procedure is similar to above, although the equation to fit now is a negative binomial model,  and the selection of genes is then done using a $Z$-test.
* `BrenneckeGetVariableGenes` fits a function between CV$^2$ and mean expression. 

With the exception of scanpy and triku, the rest of functions are set on $R$. We will use jupyter's `%%R` magic command, and `anndata2ri` to transform `annData` into `SingleCellExperiment` objects, and we will generate the functions to accept that annData and return the list of selected features. The functions have to be set up in notebook, and cannot be externalized. 

M3Drop requires a normalization step, which will be done in-situ.

In [None]:
%load_ext autoreload

In [None]:
%autoreload 2

In [None]:
import triku as tk
import scanpy as sc
import pandas as pd
import numpy as np
import scipy.sparse as spr
import scipy.stats as sts
import os

from tqdm.notebook import tqdm

from bokeh.io import show, output_notebook, reset_output
from bokeh.plotting import figure
from bokeh.models import LinearColorMapper

import matplotlib.pyplot as plt
import matplotlib as mpl

from sklearn.metrics import adjusted_rand_score as ARS
from sklearn.metrics import adjusted_mutual_info_score as AMI

reset_output()
output_notebook()

In [None]:
import sys, os
sys.path.insert(0, os.getcwd() + '/code')

# Selection of palettes for cluster coloring, and scatter values
from comparing_feat_sel import clustering_binary_search
from palettes_and_cmaps import magma, bold_and_vivid

In [None]:
%matplotlib inline

In [None]:
import rpy2.rinterface_lib.callbacks, logging
from rpy2.robjects import pandas2ri
import anndata2ri

In [None]:
# Ignore R warning messages
#Note: this can be commented out to get more verbose R output
rpy2.rinterface_lib.callbacks.logger.setLevel(logging.ERROR)

# Automatically convert rpy2 outputs to pandas dataframes
anndata2ri.activate()
pandas2ri.activate()
%load_ext rpy2.ipython
%load_ext rmagic

In [None]:
%%R
# Load all the R libraries we will be using in the notebook
library(M3Drop) # Depends on r-foreing (conda-forge) and Hmisc and reldist (install.packages)
library(scry) # If R < 4, launch commit 9f0fc819

In [None]:
os.makedirs(os.getcwd() + '/exports/comparisons/', exist_ok=True)

In [None]:
%%R

run_scry <- function(sce){ #adata
    adata_ret = devianceFeatureSelection(sce, nkeep=dim(sce)[1])
    return(adata_ret) #returns adata with stats on .var
} 


run_brennecke <- function(sce){ #df
    res_df <- BrenneckeGetVariableGenes(sce, suppress.plot=TRUE, fdr=100)
    return(res_df) # returns sorted df with genes and stats
}


run_M3Drop <- function(sce){
    norm <- M3DropConvertData(sce, is.counts=TRUE)
    DE_genes <- M3DropFeatureSelection(norm, suppress.plot=TRUE, mt_threshold=50)
    return(DE_genes) # returns sorted df with genes and stats
    
}

run_NBumi <- function(sce){
    count_mat <- NBumiConvertData(sce, is.counts=TRUE)
    DANB_fit <- NBumiFitModel(count_mat)
    NBDropFS <- NBumiFeatureSelectionCombinedDrop(DANB_fit, suppress.plot=TRUE, qval.thresh=10)
    return(NBDropFS)  # returns sorted df with genes and stats
    
}


In [None]:
def run_scanpy(adata):
    adata = adata.copy()
    sc.pp.log1p(adata)
    ret = sc.pp.highly_variable_genes(adata, n_top_genes=len(adata), inplace=False)
    df = pd.DataFrame(ret)
    df =  df.set_index(adata.var_names)
    return df # returns df with stats

def run_variable(adata):
    if spr.issparse(adata.X):
        std = adata.X.power(2).mean(0) - np.power(adata.X.mean(0), 2) 
        std = np.asarray(std).flatten()        
    else:
        std = adata.X.std(0)
        
    return std #returns vector with order as var_names 

def run_triku(adata, seed):
    adata_copy = adata.copy()
    tk.tl.triku(adata_copy, n_comps=30, n_windows=100, random_state=seed, verbose='error')
    return adata_copy.var['emd_distance'] #pd series with distance

In [None]:
## ORDENAR Y RELLENAR!

In [None]:
def create_df_feature_ranking(adata, title_prefix, ):
    """
    Create a dataframe with the ranking of features, and another one with the feature values. The adata must be the raw
    adata. From that we will create a adata_df necessary for some R methods.
    The adata will include:
    - Triku with 10 seeds 'triku_SEEDN'
    - Scanpy's HVG 'scanpy'
    - Std 'std'
    - scry 'scry'
    - brennecke 'brennecke'
    - M3Drop 'm3drop'
    - NBumi 'nbumi'
    
    After each method is run, we will fill the dataframe values, with the values of the metrics used for feature selection, 
    and the dataframe of rankings with the rankings based on the returned value (0, 1, 2, etc.). 
    We create two separate dataframes because the df with values might be reserved for other purposes. The rank dataframes is interesting
    because the values on the values dataframe have different argsort orders depending on the column (M3drop and NBumi direct, rest reverse).
    """
    
    adata = adata.copy()
#     sc.pp.subsample(adata, 0.05)
    sc.pp.filter_genes(adata, min_cells=1)
    sc.pp.filter_cells(adata, min_genes=1)
    try:
        adata.X = np.asarray(adata.X.todense())
    except:
        pass
    
    adata_groups = [i.replace('Group', '') for i in adata.obs['Group']]
    adata.obs['groupn'] = adata_groups
    adata_df = pd.DataFrame(adata.X.T.astype(int), index=adata.var_names, columns=adata.obs_names)
        
    adata_short = sc.AnnData(X = adata.X) # we have to create a clean adata because some column break Rpush
    adata_short.var_names, adata_short.obs_names = adata.var_names, adata.obs_names
    %Rpush adata_df
    %Rpush adata_short
    
    
    index, columns = adata.var_names, [f'triku_{i}' for i in range(10)] + ['scanpy', 'std', 'scry', 'brennecke', 'm3drop', 'nbumi']
    df_values, df_ranks = pd.DataFrame(index=index, columns=columns), pd.DataFrame(index=index, columns=columns)
    
    for i in range(10):
        df_emd_distance = run_triku(adata, i)
        df_values.loc[df_emd_distance.index, f'triku_{i}'] = df_emd_distance.values
        
    
    scanpy_ret = run_scanpy(adata)
    df_values.loc[scanpy_ret.index, 'scanpy'] = scanpy_ret['dispersions_norm'].values
    
    std_ret = run_variable(adata)
    df_values.loc[:, 'std'] = std_ret
    
    scry_ret = %R run_scry(adata_short)
    df_values.loc[scry_ret.var.index, 'scry'] = scry_ret.var['binomial_deviance'].values
    
    brennecke_ret = %R run_brennecke(adata_df)
    df_values.loc[brennecke_ret.index, 'brennecke'] = brennecke_ret['effect.size'].values
    
    M3Drop_ret = %R run_M3Drop(adata_df)
    df_values.loc[M3Drop_ret.index, 'm3drop'] = M3Drop_ret['q.value'].values
    
    NBumi_ret = %R run_NBumi(adata_df)
    df_values.loc[NBumi_ret.index, 'nbumi'] = NBumi_ret['q.value'].values
    
    # Now we will fill df_ranks with an argsort !!!!! M3DROP and NBumi is not [::-1] because they are q-values 
    for col in [f'triku_{i}' for i in range(10)] + ['scanpy', 'std', 'scry', 'brennecke']:
        df_ranks[col] = df_values[col].values.argsort()[::-1].argsort()
    for col in ['m3drop', 'nbumi']:
        df_ranks[col] = df_values[col].values.argsort().argsort() # double argsort to return the rank!
    
    df_ranks.to_csv(os.getcwd() + '/exports/comparisons/' + title_prefix + '_feature_ranks.csv')
    df_values.to_csv(os.getcwd() + '/exports/comparisons/' + title_prefix + '_feature_values.csv')
    
    return df_values, df_ranks

In [None]:
for deprob in tqdm([0.005, 0.0065, 0.008, 0.01, 0.013, 0.016, 0.025, 0.05, 0.1, 0.3]):
    try:
        adata_deprob = sc.read(splatter_dir + f'/splatter_deprob_{deprob}.loom', cache=True)
    except:
        adata_deprob = sc.read(splatter_dir + f'/splatter_deprob_{deprob}.loom')
    create_df_feature_ranking(adata_deprob, f'scatter_{deprob}')

# ARI - Random datasets
Using ARI on random datasets is a measure to assess the effectiveness of the feature selection. Random datasets were prepared with different degrees of differentially expressed gene probability, so that we can compare the leiden clusterign solution with the 9 populations. Triku can be run with different seeds, but the rest of methods are deterministic. However, leiden clustering in all cases can be run with a seed. Therefore, we are going to run all processes with 10 seeds (although the deterministic processes will be run once).

To apply the ARI we need to run leiden with as many clusters as scatter populations. Since leiden runs on resolution, we need to adjust the resolution parameter to match the number of clusters. To do that we are going to implement a binary search-like algorithm. We will start with resolutions 0.3 and 2. If any of those yields the clusters, done. Else, find the midpoint, run the clustering, and if the clustering yields the number of populations, stop. Else, set the upper or lower resolution to the one that makes the desired number of clusters to be in the middle. This algorithm will try at most 5 times (it gets to resolution differences of ~0.05, which is fair).

In [None]:
splatter_dir = os.path.dirname(os.getcwd()) + '/data/splatter/'

We are going to create a dictionary where each element is a dataframe of ARIs run with each DE probability. The 
adata is of dims n_seeds x n_methods. Therefore, cell i,j will have the ARI between the populations and the optimal leiden clustering.

We are going to run the calculations with different feature numbers. To save time, we are going to use a high feature number (5000), store the selected features in places, and for future callings, use the saved data.

In [None]:
dict_df_deprob, dict_features = {}, {}
n_seeds, n_feats = 10, 200
min_res, max_res, max_depth = 0.1, 2, 6
seed = 0

In [None]:
deprob = 0.01

adata = sc.read(splatter_dir + f'/splatter_deprob_{deprob}.loom', cache=True)
sc.pp.filter_genes(adata, min_cells=1)
sc.pp.filter_cells(adata, min_genes=1)
sc.pp.subsample(adata, fraction=0.55)
adata.X = np.asarray(adata.X.todense())
adata_groups = [i.replace('Group', '') for i in adata.obs['Group']]
adata.obs['groupn'] = adata_groups
adata_df = pd.DataFrame(adata.X.T.astype(int), index=adata.var_names, columns=adata.obs_names)

In [None]:
n_feats = 100

In [None]:
adata_df = pd.DataFrame(adata.X.T.astype(int), index=adata.var_names, columns=adata.obs_names)
%Rpush adata_df n_feats

In [None]:
NBumi_feats = %R run_NBumi(adata_df, n_feats)
scanpy_feats = run_scanpy(adata, n_feats)
triku_feats, adata_triku = run_triku(adata, n_feats, 0)

In [None]:
for f in [triku_feats, scanpy_feats, NBumi_feats]:
    adata_plot = adata.copy()
    sc.pp.log1p(adata_plot)
    adata_plot.var['highly_variable'] = [i in f for i in adata_plot.var_names]
    sc.pp.pca(adata_plot, use_highly_variable=True)
    sc.pp.neighbors(adata_plot, n_neighbors=int(0.5 * len(adata_plot) ** 0.5), metric='cosine')

    
    c_f, res = clustering_binary_search(adata, min_res, max_res, max_depth, seed, len(list(dict.fromkeys(adata_groups))), f)
    print(f'{res} || ARI :',  ARS(c_f, adata_groups))
    
    adata_plot.obs['leiden'], adata_plot.obs['groups'] = c_f, adata_groups
    sc.tl.umap(adata_plot)
    sc.pl.umap(adata_plot, color=['leiden', 'groups'], legend_loc='on data')

In [None]:
for f in [triku_feats, scanpy_feats, NBumi_feats]:
    adata_plot = adata.copy()
    adata_plot.var['highly_variable'] = [i in f for i in adata_plot.var_names]
    sc.pp.pca(adata_plot, use_highly_variable=True)
    sc.pp.neighbors(adata_plot, n_neighbors=int(0.5 * len(adata_plot) ** 0.5), metric='cosine')

    
    c_f, res = clustering_binary_search(adata, min_res, max_res, max_depth, seed, len(list(dict.fromkeys(adata_groups))), f, apply_log=False)
    print(f'{res} || ARI :',  ARS(c_f, adata_groups))
    
    adata_plot.obs['leiden'], adata_plot.obs['groups'] = c_f, adata_groups
    sc.tl.umap(adata_plot)
    sc.pl.umap(adata_plot, color=['leiden', 'groups'], legend_loc='on data')

In [None]:
adata_scanpy = adata.copy()
adata_scanpy.var['highly_variable'] = [i in scanpy_feats for i in adata_plot.var_names]
sc.pp.pca(adata_scanpy, use_highly_variable=True)
sc.pp.neighbors(adata_scanpy, n_neighbors=int(0.5 * len(adata_plot) ** 0.5), metric='cosine')
sc.tl.umap(adata_scanpy)

adata_triku = adata.copy()
adata_triku.var['highly_variable'] = [i in triku_feats for i in adata_plot.var_names]
tk.tl.triku(adata_triku, n_features=n_features, n_windows=100, verbose='error')
sc.pp.pca(adata_triku, use_highly_variable=True)
sc.pp.neighbors(adata_triku, n_neighbors=int(0.5 * len(adata_plot) ** 0.5), metric='cosine')
sc.tl.umap(adata_triku)

In [None]:
common_genes = list(set(triku_feats) & set(scanpy_feats))
triku_only =list(set(triku_feats) - set(scanpy_feats))
scanpy_only = list(set(scanpy_feats) - set(triku_feats))
print(len(common_genes), len(triku_only), len(scanpy_only))

In [None]:
sc.pl.umap(adata_scanpy, color=['groupn'] + scanpy_only, legend_loc='on data', cmap=magma)

In [None]:
sc.pl.umap(adata_scanpy, color=['groupn'] + triku_only, legend_loc='on data', cmap=magma)

In [None]:
color = []
labels = []
alpha = [0.8 if i in triku_feats or i in scanpy_feats else 0.4 for i in adata.var_names]

for i in adata.var_names:
    if i in triku_feats and i in scanpy_feats:
        color.append("#000000")
        labels.append('Both')
    elif i in triku_feats:
        color.append("#900020")
        labels.append('Triku')
    elif i in scanpy_feats:
        color.append("#007ab7")
        labels.append('Scanpy')
    else:
        color.append("#bcbcbc")
        labels.append('None')


df_bokeh = pd.DataFrame({
    'm': np.log10(adata_triku.X.mean(0)),
    'z': (adata_triku.X == 0).sum(0) / adata_triku.shape[0],
    'n': adata_triku.var_names.values,
    'd': adata_triku.var["emd_distance"],
    'color': color, 'label':labels, 'alpha':alpha
    })[:]

In [None]:
p = figure(tools="box_zoom,hover,reset", plot_height=400, plot_width=400, tooltips=[("Gene","@n"), ('Value', '@d')])

p.scatter('m', 'd', source=df_bokeh,
          alpha=0.7, line_color=None,
         color='color', legend_group='label')

p.legend.location = 'top_left'
show(p)

In [None]:
def get_info_gene(adata, gene):
    mean_exp_val = []
    
    groups = sorted(list(set(adata.obs['groupn'].values)))
    dict_argwhere = {group: np.argwhere(adata.obs['groupn'].values == group) for group in groups}
    exp_gene = adata_triku[:, gene].X.ravel()
    
    for g in groups:
        exp_group = np.sort(exp_gene[dict_argwhere[g]].ravel())
        mean_exp_val.append(np.mean(exp_group[: int(0.90 * len(exp_group))])) # for genes with small expression it may amplify noise
        
    
    mean_exp_val_1 = np.array(mean_exp_val)/sum(mean_exp_val)
    
#     info = sts.entropy(mean_exp_val_1) / np.log(len(groups))
    info = max(np.sort(mean_exp_val_1)[3] - np.sort(mean_exp_val_1)[0], np.sort(mean_exp_val_1)[-1] - np.sort(mean_exp_val_1)[-4])
    
    return info, mean_exp_val_1
    


In [None]:
gene = 'Gene13823'

ent, arr = get_info_gene(adata_triku, gene)
print(ent)
print(list(arr))

In [None]:
sc.pl.umap(adata_triku, color=['groupn', gene], legend_loc='on data', cmap=magma)
sc.pl.umap(adata_scanpy, color=['groupn', gene], legend_loc='on data', cmap=magma)

In [None]:
dict_df_deprob, dict_features = {}, {}
n_seeds, n_feats = 10, 5000
min_res, max_res, max_depth = 0.1, 2, 6

for deprob in [0.005, 0.0065, 0.008, 0.01, 0.013, 0.016, 0.025, 0.05, 0.1, 0.3]:
    adata = sc.read_loom(splatter_dir + f'/splatter_deprob_{deprob}.loom')
    adata.X = np.asarray(adata.X.todense())
    adata_groups = [i.replace('Group', '') for i in adata.obs['Group']]
    adata_df = pd.DataFrame(np.asarray(adata.X.T.todense()), index=adata.var_names, columns=adata.obs_names)
    
    %Rpush adata_df n_feats
    
    df = pd.DataFrame(index = np.arange(n_seeds), columns=['triku', 'var', 'scanpy', 'scry', 'brennecke', 'M3Drop', 'NBUmi'])
    
    for seed in np.arange(n_seeds):
        if seed == 0:
            triku_feats = run_triku(adata, n_feats, seed)
            var_feats = run_variable(adata, n_feats)
            scanpy_feats = run_scanpy(adata, n_feats)
            NBumi_feats = %R run_NBumi(adata_df, n_feats)
            scry_feats = %R run_scry(adata_df, n_feats)
            brennecke_feats = %R run_brennecke(adata_df, n_feats)
            M3Drop_feats = %R run_M3Drop(adata_df, n_feats)
            
            dict_features[f'triku_{seed}'], dict_features['var'], dict_features['scanpy'] = triku_feats, var_feats, scanpy_feats
            dict_features['scry'], dict_features['brennecke'], dict_features['M3Drop'], dict_features['NBUmi'] = scry_feats, brennecke_feats, M3Drop_feats, NBumi_feats
        
        else:
            triku_feats = run_triku(adata, n_feats, seed)
            dict_features[f'triku_{seed}'] = triku_feats
            
        # Run clustering with each method and get ARI
        c_triku = clustering_binary_search(adata, min_res, max_res, max_depth, seed, len(list(dict.fromkeys(adata_groups))), triku_feats)
        df.loc[seed, 'triku'] = ARI(c_triku, adata_groups)
        c_var = clustering_binary_search(adata, min_res, max_res, max_depth, seed, len(list(dict.fromkeys(adata_groups))), var_feats)
        df.loc[seed, 'var'] = ARI(c_var, adata_groups)
        c_scanpy = clustering_binary_search(adata, min_res, max_res, max_depth, seed, len(list(dict.fromkeys(adata_groups))), scanpy_feats)
        df.loc[seed, 'scanpy'] = ARI(c_scanpy, adata_groups)

        c_scry = clustering_binary_search(adata, min_res, max_res, max_depth, seed, len(list(dict.fromkeys(adata_groups))), scry_feats)
        df.loc[seed, 'scry'] = ARI(c_scry, adata_groups)
        c_brennecke = clustering_binary_search(adata, min_res, max_res, max_depth, seed, len(list(dict.fromkeys(adata_groups))), brennecke_feats)
        df.loc[seed, 'brennecke'] = ARI(c_brennecke, adata_groups)
        c_M3Drop = clustering_binary_search(adata, min_res, max_res, max_depth, seed, len(list(dict.fromkeys(adata_groups))), M3Drop_feats)
        df.loc[seed, 'M3Drop'] = ARI(c_M3Drop, adata_groups)
        c_NBumi = clustering_binary_search(adata, min_res, max_res, max_depth, seed, len(list(dict.fromkeys(adata_groups))), scry_NBumi)
        df.loc[seed, 'NBumi'] = ARI(c_NBumi, adata_groups)
        
        

In [None]:
%%R

aaa <- function(a, b, c){
    return(a+b+c)
}

In [None]:
a = 9
b = 19
c = 99

In [None]:
%%R -i a -i b -i c -o d

d = aaa(a,b,c)

In [None]:
d

In [None]:
%Rpush a b c
d = %R aaa(a,b,c)

In [None]:
d

In [None]:
%%R -i df -i n -o a

a = run_scry(df, n)

In [None]:
n = 1000

In [None]:
%Rpush df n
a = %Rget run_scry(df, n)

In [None]:
a

In [None]:
adata = sc.read_loom(splatter_dir + f'/splatter_deprob_{0.01}.loom')
adata.obs['groups'] = [i.replace('Group', '') for i in adata.obs['Group']]

In [None]:
adata.obs['Group']

In [None]:
adata = sc.datasets.pbmc3k()
sc.pp.filter_genes(adata, min_cells=10)
df = pd.DataFrame(np.asarray(adata.X.T.todense()), index=adata.var_names, columns=adata.obs_names)

In [None]:
%%R -i df -i df -o run_M3Drop

run_M3Drop <- run_NBumi(df, 1500)


In [None]:
%%R -i df -i adata -o scry_ret -o brennecke_ret -o M3Drop_ret -o NBumi_ret

scry_ret <- run_scry(adata, 1500)
brennecke_ret <- run_brennecke(adata, 1500)
M3Drop_ret <- run_M3Drop(df, 1500)
NBumi_ret <- run_NBumi(df, 1500) 

In [None]:
run_M3Drop

In [None]:
import sys, os
sys.path.insert(0, os.getcwd() + '/code')

# Selection of palettes for cluster coloring, and scatter values
from palettes_and_cmaps import magma, bold_and_vivid
from robustness_functions import run_batch, random_noise_parameter, plot_scatter_parameter, compare_parameter, plot_scatter_datasets


In [None]:
save_dir = os.getcwd() + '/exports/'

In [None]:
data_dir = os.path.dirname(os.getcwd()) + '/data/'

In [None]:
import os

In [None]:
save_dir = os.getcwd() + '/exports/'
read_dir = data_dir + 'Ding_2020/'

In [None]:
# To be able to read you must have: EXACTLY in that manner:
# matrix.mtx.gz (you MUST rename it)
# features.tsv(you should rename it)
# barcodes.tsv (you should rename it)
from scipy.io import mmread

matrix = mmread('/media/seth/SETH_DATA/SETH_Alex/triku/data/Ding_2020/preprocessed/mouse/matrix.mtx.gz')
features = np.loadtxt('/media/seth/SETH_DATA/SETH_Alex/triku/data/Ding_2020/preprocessed/mouse/features.tsv', dtype=str)
barcodes = np.loadtxt('/media/seth/SETH_DATA/SETH_Alex/triku/data/Ding_2020/preprocessed/mouse/barcodes.tsv', dtype=str)

In [None]:
adata = sc.AnnData(X=matrix.tocsr()).transpose()
adata.var_names = features
adata.obs_names = barcodes

In [None]:
meta = pd.read_csv('/media/seth/SETH_DATA/SETH_Alex/triku/data/Ding_2020/preprocessed/mouse/meta_combined.txt', sep='\t', skiprows=[1])
adata = adata[meta['NAME'].values]
adata.obs['method'] = meta['Method'].values
adata.obs['CellType'] = meta['CellType'].values

In [None]:
adata

In [None]:
a = 1_000
a-1

In [None]:
meta

In [None]:
adata

In [None]:
matrix

In [None]:
adata = sc.read_text(read_dir + '/GSE133545_SMARTseq2_human_exp_mat.tsv').transpose()
adata.var_names_make_unique()
sc.pp.filter_genes(adata, min_cells=30)

In [None]:
tk.tl.triku(adata, verbose='triku', n_procs=4, knn=50)

In [None]:
def func_a(a, b): return np.convolve(a, b, )
def func_b(a, b): 
    x = fftconvolve(a, b, )
    x[x < 0] = 0
    
    return x

def apply_convolution_read_counts(probs: np.ndarray, knn: int, func) -> (np.ndarray, np.ndarray):
    """
    Convolution of functions. The function applies a convolution using np.convolve
    of a probability distribution knn times. The result is an array of N elements (N arises as the convolution
    of a n-length array knn times) where the element i has the probability of i being observed.

    Parameters
    ----------
    probs : np.array
        Object with count matrix. If `pandas.DataFrame`, rows are cells and columns are genes.
    knn : int
        Number of kNN
    """
      
    
    # We are calculating the convolution of cells with positive expression. Thus, in the first distribution
    # we have to remove the cells with 0 reads, and rescale the probabilities.
    arr_0 = probs.copy()
    arr_0[0] = 0  # TODO: this will fail in log-transformed data
    arr_0 /= arr_0.sum()

    # We will use arr_bvase as the array with the read distribution
    arr_base = probs.copy()
    
    arr_convolve = func(arr_0, arr_base, )
    
    for knni in range(2, knn):
        arr_convolve = func(arr_convolve, arr_base, )

    # TODO: check the probability sum is 1 and, if so, remove
    arr_prob = arr_convolve / arr_convolve.sum()

    # TODO: if log transformed, this is untrue. Should not be arange.
    return np.arange(len(arr_prob)), arr_prob

In [None]:
counts_gene = adata.X[:, 377]
from tqdm.notebook import tqdm
import time

In [None]:
times_a, sums_a, times_b, sums_b = [], [], [], []

In [None]:
for i in tqdm(range(500, 1000)):
    counts_gene = adata.X[:, i]
    y_probs = np.bincount(counts_gene.astype(int)) / len(counts_gene)
    t = time.time()
    apply_convolution_read_counts(y_probs, 50, func_a)
    times_a.append(time.time() - t)
    sums_a.append(counts_gene.sum())
    
    
    t = time.time()
    apply_convolution_read_counts(y_probs, 50, func_b)
    times_b.append(time.time() - t)
    sums_b.append(counts_gene.sum())

In [None]:
fig = plt.figure(figsize=(15,8))
plt.scatter(np.log10(sums_a), np.log10(times_a))
plt.scatter(np.log10(sums_b), np.log10(times_b))

In [None]:
y_probs, len(y_probs)

In [None]:
a = apply_convolution_read_counts(y_probs, 8)

In [None]:
b = apply_convolution_read_counts(y_probs, 8)

In [None]:
np.sum(((a[1]-b[1])**2)**0.5)

In [None]:
a, 

## How to run this notebook
This notebook contains several cells, each explaining its own purpose. The notebook is prepared to run with only one dataset from the Holger dataset. I force the use of this dataset simply because the robustness is easily checked in all cases, and there is no point in generalise the functions to other datasets, at least in this case.

**SET THE VARIABLES BELOW**

In [None]:
lib_prep, org = 'QUARTZseq', 'human'

# Robustness based on seed
Many of the parts from triku require a random seed: PCA calculation (because we use the randomized variant), nearest neighbor calculation, and random matrix generation. Each of those processes interfere with the distances we obtain, either the ones from the original datasets, or the ones from the randomized dataset. Due to that, we have to evaluate this noise, and see how it might affect the genes that are given, even without considering the randomisation.

To do that we will study which of the variables affect, or enhance, the noise: kNN calculation, PCA, or dataset randomization (number of windows is not stochastic, so it won't be a factor). To see the effect, we will do two types of plots: variability evaluation, and limit of features achievable given the level of noise.

For the first type of plot we will fix a variable and evaluate a range of the other variables (fix kNN and evaluate number of components, and vice-versa). If, for example, we fix kNN, for each of the possible number of components, we will take pairs of dataframes (combinations of two seeds), and compare their corrected and uncorrected distance together. 

How do we do the comparison: if $d_A$ is a distance from df with seed A, and $d_B$ is the distance in the same gene for B, then the comparison value is: 

$$\frac{|d_A - d_B|}{|d_A| + |d_B|}$$

This value is range between 0 ($d_A = d_B$) and 1 ($d_A$ and $d_B$ have opposite signs).

We plot a swarmplot with three categories: the first 250 features (based on highest distance), 1000 and 5000 features. We should expect less noise on the first features, because they are the ones with more distance. Also, for each option in the swarmplot, the column on the **left are distances without randomization** and the column on the **right are distances with randomization**. We should expect higher noise on distances with randomization, because the random noise from the randomized dataset is highger then the one from the non-randomized one. 

In [None]:
df_0_250 = random_noise_parameter(lib_prep, org, save_dir, 0, 250, what='relative noise', by='knn')
df_250_1000 = random_noise_parameter(lib_prep, org, save_dir, 250, 1000, what='relative noise', by='knn')
df_1000_5000 = random_noise_parameter(lib_prep, org, save_dir, 1000, 5000, what='relative noise', by='knn')

# Remember: left = non randomized - right = randomized
plot_scatter_parameter(
    [df_1000_5000, df_250_1000, df_0_250], 
    ['1000 - 5000', '250 - 1000', '0 - 250'], 
    lib_prep, org, by='knn', 
    title='Noise_distance_from_seed,_kNN', 
    ylabel="$\\frac{|d_A - d_B|}{|d_A| + |d_B|}$")

In [None]:
df_0_250 = random_noise_parameter(lib_prep, org, save_dir, 0, 250, what='relative noise', by='pca')
df_250_1000 = random_noise_parameter(lib_prep, org, save_dir, 250, 1000, what='relative noise', by='pca')
df_1000_5000 = random_noise_parameter(lib_prep, org, save_dir, 1000, 5000, what='relative noise', by='pca')

# Remember: left = non randomized - right = randomized
plot_scatter_parameter(
    [df_1000_5000, df_250_1000, df_0_250], 
    ['1000 - 5000', '250 - 1000', '0 - 250'], 
    lib_prep, org, by='pca',
    title='Noise_distance_from_seed,_PCA',
    ylabel="$\\frac{|d_A - d_B|}{|d_A| + |d_B|}$")

For kNN we see that there is generally a sweet spot between $\sqrt{N}/2$ and 2$\sqrt{N}$, with smaller variation in the distances without randomization. As for PCA components, we see that, interestingly, seed noise increases with the number of PCA components. Although paradoxical, it makes sense: first components are less prone to ve variable, and thus it is more difficult to experience noise. In fact, if we consider 3 components, the noise is almost 0 in the non-randomized set of distances, which tells us that the noise in the randomized set of distances arises mainly due to dataset randomization. However, although the noise is small, it does not mean that the selected features are the correct ones! Fewer components means less information from the datasets. We'll see that when we compare festures across components, later on. 

In the next plots we will consider directly the overlapping between features. For each set of features within an overlap range, we will see the percentage of overlap between those features. That is, for example, for kNN = $\sqrt{N}$ and 100 PCA components, compare the first 250 features between seed 0 and seed 1. Instead of considering features 250 - 1000 and 1000 - 5000, we will consider features 0 - 1000 and 0 - 5000 directly.

In [None]:
df_0_250 = random_noise_parameter(lib_prep, org, save_dir, 0, 250, what='overlap', by='knn')
df_0_1000 = random_noise_parameter(lib_prep, org, save_dir, 0, 1000, what='overlap', by='knn')
df_0_5000 = random_noise_parameter(lib_prep, org, save_dir, 0, 5000, what='overlap', by='knn')

# Remember: left = non randomized - right = randomized
plot_scatter_parameter([df_0_5000, df_0_1000, df_0_250], 
   ['0 - 5000', '0 - 1000', '0 - 250'],
    lib_prep, org, step=1, by='knn',
    palette = 'sunsetmid3', 
    title='Overlap_of_features_from_seed,_kNN',
    ylabel='Overlap')

In [None]:
df_0_250 = random_noise_parameter(lib_prep, org, save_dir, 0, 250, what='overlap', by='pca')
df_0_1000 = random_noise_parameter(lib_prep, org, save_dir, 0, 1000, what='overlap', by='pca')
df_0_5000 = random_noise_parameter(lib_prep, org, save_dir, 0, 5000, what='overlap', by='pca')

# Remember: left = non randomized - right = randomized
plot_scatter_parameter([df_0_5000, df_0_1000, df_0_250], 
            ['0 - 5000', '0 - 1000', '0 - 250'], lib_prep, org, 
            step=1, palette = 'sunsetmid3', 
            by='pca', title='Overlap_of_features_from_seed,_PCA', 
                      ylabel='Overlap')

We see a similar trend than in the previous case: low kNN values can be detrimental to the quality of feature selection, although the effect is resolved with a number of kNN between $\sqrt{N}/2$ and 2$\sqrt{N}$. Regarding PCA components, there is not much variation, with overlap values higher than 95% at sensible PCA component values near 20-30.

# Robustness between different parameter values

In this section we are going to compare the overlapping percentage of number of features given different parameter values. As in the previous section, we are foing to fix the number of kNN in $\sqrt{N}$, number of PCA components in 30, additionally, we are going to fix the two parameters to see changes on number of windows for median correction. 

The strategy in this case will be the same: consider distance values for each of the parametters, and calculate the overlap between the first N features. For example, when comparing kNN values, we are going to compare the values from $\sqrt{N}$, seed 0 with $2\sqrt{N}$ seed 1, seed 2, etc. We will also compare $\sqrt{N}$ with itself, which has already been done, but which will still be useful.

In [None]:
df_violin_0_500 = compare_parameter(lib_prep, org, save_dir, 0, 500, what='overlap', by='knn')
df_violin_0_1000 = compare_parameter(lib_prep, org, save_dir, 0, 1000, what='overlap', by='knn')
df_violin_0_2500 = compare_parameter(lib_prep, org, save_dir, 0, 2500, what='overlap', by='knn')
df_violin_0_5000 = compare_parameter(lib_prep, org, save_dir, 0, 5000, what='overlap', by='knn')

# Remember: left = non randomized - right = randomized
plot_scatter_parameter([df_violin_0_5000, df_violin_0_2500, df_violin_0_1000, df_violin_0_500], 
   ['0 - 5000',  '0 - 2500', '0 - 1000', '0 - 500'], 
    lib_prep, org, step=1, 
    palette = 'sunsetmid4', by='knn',
    title='kNN_robustness,_overlap', 
    ylabel='Overlap')

In [None]:
df_violin_0_500 = compare_parameter(lib_prep, org, save_dir, 0, 500, what='overlap', by='pca')
df_violin_0_1000 = compare_parameter(lib_prep, org, save_dir, 0, 1000, what='overlap', by='pca')
df_violin_0_2500 = compare_parameter(lib_prep, org, save_dir, 0, 2500, what='overlap', by='pca')
df_violin_0_5000 = compare_parameter(lib_prep, org, save_dir, 0, 5000, what='overlap', by='pca')

# Remember: left = non randomized - right = randomized
plot_scatter_parameter([df_violin_0_5000, df_violin_0_2500, df_violin_0_1000, df_violin_0_500], 
   ['0 - 5000',  '0 - 2500', '0 - 1000', '0 - 500'], 
    lib_prep, org, step=1, 
    palette = 'sunsetmid4', by='pca', 
    title='PCA_robustness,_overlap', 
    ylabel='Overlap')

In [None]:
df_violin_0_500 = compare_parameter(lib_prep, org, save_dir, 0, 500, what='overlap', by='w')
df_violin_0_1000 = compare_parameter(lib_prep, org, save_dir, 0, 1000, what='overlap', by='w')
df_violin_0_2500 = compare_parameter(lib_prep, org, save_dir, 0, 2500, what='overlap', by='w')
df_violin_0_5000 = compare_parameter(lib_prep, org, save_dir, 0, 5000, what='overlap', by='w')

# Remember: left = non randomized - right = randomized
plot_scatter_parameter([df_violin_0_5000, df_violin_0_2500, df_violin_0_1000, df_violin_0_500], 
   ['0 - 5000',  '0 - 2500', '0 - 1000', '0 - 500'], 
    lib_prep, org, step=1, palette = 'sunsetmid4', by='w', 
    title='window_robustness,_overlap', 
    ylabel='Overlap')

# SACAR COMENTARIOS DE AQUI y HACER WINDOW!

In [None]:
df_violin_0_500 = compare_parameter(lib_prep, org, save_dir, 0, 500, what='correlation', by='knn')
df_violin_0_1000 = compare_parameter(lib_prep, org, save_dir, 0, 1000, what='correlation', by='knn')
df_violin_0_2500 = compare_parameter(lib_prep, org, save_dir, 0, 2500, what='correlation', by='knn')
df_violin_0_5000 = compare_parameter(lib_prep, org, save_dir, 0, 5000, what='correlation', by='knn')

# Remember: left = non randomized - right = randomized
plot_scatter_parameter([df_violin_0_5000, df_violin_0_2500, df_violin_0_1000, df_violin_0_500], 
   ['0 - 5000',  '0 - 2500', '0 - 1000', '0 - 500'], 
    lib_prep, org, step=1, 
    palette = 'sunsetmid4', by='knn',
    title='kNN_robustness,_correlation', 
    ylabel='Pearson correlation')

In [None]:
df_violin_0_500 = compare_parameter(lib_prep, org, save_dir, 0, 500, what='correlation', by='pca')
df_violin_0_1000 = compare_parameter(lib_prep, org, save_dir, 0, 1000, what='correlation', by='pca')
df_violin_0_2500 = compare_parameter(lib_prep, org, save_dir, 0, 2500, what='correlation', by='pca')
df_violin_0_5000 = compare_parameter(lib_prep, org, save_dir, 0, 5000, what='correlation', by='pca')

# Remember: left = non randomized - right = randomized
plot_scatter_parameter([df_violin_0_5000, df_violin_0_2500, df_violin_0_1000, df_violin_0_500], 
   ['0 - 5000',  '0 - 2500', '0 - 1000', '0 - 500'], 
    lib_prep, org, step=1, 
    palette = 'sunsetmid4', by='pca', 
    title='PCA_robustness,_correlation', 
    ylabel='Pearson correlation')

In [None]:
df_violin_0_500 = compare_parameter(lib_prep, org, save_dir, 0, 500, what='correlation', by='w')
df_violin_0_1000 = compare_parameter(lib_prep, org, save_dir, 0, 1000, what='correlation', by='w')
df_violin_0_2500 = compare_parameter(lib_prep, org, save_dir, 0, 2500, what='correlation', by='w')
df_violin_0_5000 = compare_parameter(lib_prep, org, save_dir, 0, 5000, what='correlation', by='w')

# Remember: left = non randomized - right = randomized
plot_scatter_parameter([df_violin_0_5000, df_violin_0_2500, df_violin_0_1000, df_violin_0_500], 
   ['0 - 5000',  '0 - 2500', '0 - 1000', '0 - 500'], 
    lib_prep, org, step=1, palette = 'sunsetmid4', by='w',
    title='window_robustness,_correlation',
    ylabel="Pearson correlation")

# SACAR COMENTARIOS DE AQUI y HACER WINDOW!

In [None]:
org, lib_preps = 'human', ['SingleNuclei', 'inDrop', '10XV3', 'SMARTseq2', 'CELseq2', 'QUARTZseq',]

In [None]:
list_dicts_dfs = [{}, {}, {}]
by, what = 'knn', 'overlap'
low_val, mid_val, hi_val = 500, 1500, 2500

for lib_prep in lib_preps:
    list_dicts_dfs[0][lib_prep] = compare_parameter(lib_prep, org, save_dir, 0, low_val, what=what, by=by)
    list_dicts_dfs[1][lib_prep] = compare_parameter(lib_prep, org, save_dir, 0, mid_val, what=what, by=by)
    list_dicts_dfs[2][lib_prep] = compare_parameter(lib_prep, org, save_dir, 0, hi_val, what=what, by=by)

In [None]:
plot_scatter_datasets(list_dicts_dfs, org, by, figsize=(7, 4),  palette='prism',
                           title='', ylabel='', save_dir='robustness_figs')

In [None]:
data_dir = os.path.dirname(os.getcwd()) + '/data/'
save_dir = os.getcwd() + '/exports/'
read_dir = data_dir + 'Holger_CNAG_2019/'


for file in os.listdir(read_dir):
    if org in file and 'exp_mat' in file and lib_prep in file:
        file_in = file
        
        
adata = sc.read_text(read_dir + file_in).transpose()
adata.var_names_make_unique()
sc.pp.filter_genes(adata, min_cells=10)

In [None]:
adata_1 = adata.copy()
tk.tl.triku(adata_1, n_comps=30, n_windows=70, 
            knn=54, random_state=1)

dist_uncorrected = adata_1.var['emd_distance_uncorrected'] - adata_1.var['emd_distance_random']
n_features = np.sum(adata_1.var['highly_variable'].values)
cutoff = np.sort(dist_uncorrected)[-n_features]
highly_variable_uncorrected = dist_uncorrected > cutoff
highly_variable_corrected = adata_1.var['highly_variable']
color = []
labels = []

for i in range(len(highly_variable_uncorrected)):
    if highly_variable_corrected[i] and highly_variable_uncorrected[i]:
        color.append("#000000")
        labels.append('Both')
    elif highly_variable_corrected[i] and not highly_variable_uncorrected[i]:
        color.append("#900020")
        labels.append('Corrected')
    elif not highly_variable_corrected[i] and highly_variable_uncorrected[i]:
        color.append("#007ab7")
        labels.append('Uncorrected')
    else:
        color.append("#bcbcbc")
        labels.append('None')


df_bokeh_1 = pd.DataFrame({
    'm': np.log10(adata_1.X.mean(0)),
    'z': (adata_1.X == 0).sum(0) / adata_1.shape[0],
    'n': adata_1.var_names.values,
    'd': adata_1.var["emd_distance_uncorrected"],
    'e': adata_1.var["emd_distance_uncorrected"] - adata_1.var["emd_distance_random"],
    'e_correct': adata_1.var["emd_distance"],
    'color': color, 'label':labels
    })[:]

In [None]:
adata_2 = adata.copy()
tk.tl.triku(adata_2, n_comps=30, 
            n_windows=100, knn=15, 
            random_state=1)

dist_uncorrected = adata_2.var['emd_distance_uncorrected'] - adata_2.var['emd_distance_random']
n_features = np.sum(adata_2.var['highly_variable'].values)
cutoff = np.sort(dist_uncorrected)[-n_features]
highly_variable_uncorrected = dist_uncorrected > cutoff
highly_variable_corrected = adata_2.var['highly_variable']
color = []
labels = []

for i in range(len(highly_variable_uncorrected)):
    if highly_variable_corrected[i] and highly_variable_uncorrected[i]:
        color.append("#000000")
        labels.append('Both')
    elif highly_variable_corrected[i] and not highly_variable_uncorrected[i]:
        color.append("#900020")
        labels.append('Corrected')
    elif not highly_variable_corrected[i] and highly_variable_uncorrected[i]:
        color.append("#007ab7")
        labels.append('Uncorrected')
    else:
        color.append("#bcbcbc")
        labels.append('None')


df_bokeh_2 = pd.DataFrame({
    'm': np.log10(adata_2.X.mean(0)),
    'z': (adata_2.X == 0).sum(0) / adata_2.shape[0],
    'n': adata_2.var_names.values,
    'd': adata_2.var["emd_distance_uncorrected"],
    'e': adata_2.var["emd_distance_uncorrected"] - adata_2.var["emd_distance_random"],
    'e_correct': adata_2.var["emd_distance"],
    'color': color, 'label':labels
    })[:]

In [None]:
p = figure(tools="box_zoom,hover,reset", plot_height=400, plot_width=400, tooltips=[("Gene","@n"), 
                                                                                    ('Value', '@e'), 
                                                                                    ('Unc', '@d')])

p.scatter('m', 'e_correct', source=df_bokeh_1,
          alpha=0.7, line_color=None,
         color='color', legend_group='label')

p.legend.location = 'top_right'
show(p)

In [None]:
p = figure(tools="box_zoom,hover,reset", plot_height=400, plot_width=400, tooltips=[("Gene","@n"), 
                                                                                    ('Value', '@e'), 
                                                                                    ('Unc', '@d')])

p.scatter('m', 'e_correct', source=df_bokeh_2,
          alpha=0.7, line_color=None,
         color='color', legend_group='label')

p.legend.location = 'top_right'
show(p)

In [None]:
dist_seed_1 = adata_1.var['emd_distance']
dist_seed_2 = adata_2.var['emd_distance']


In [None]:
N0, Nf = 0, 1000
len(np.intersect1d(dist_seed_1.sort_values(ascending=False).index[N0:Nf].values,
                dist_seed_2.sort_values(ascending=False).index[N0:Nf].values))/(Nf - N0)

In [None]:
x = np.arange(0, 2000, 15)
y = [len(np.intersect1d(dist_seed_1.sort_values(ascending=False).index[N0:Nf].values,
                dist_seed_2.sort_values(ascending=False).index[N0:Nf].values))/(Nf - N0) for Nf in x]

In [None]:
plt.scatter(x, y)