# Comparing other feature selection methods with triku
In this notebook we will compare the performance of triku, compared to other methods. 

The methods that will be compared will be the following:
* Select genes with highest variance.   
* Scanpy's `sc.pp.highly_variable_genes`: It is based on Seurat's `vst` method, so they should return similar results.
* scry `devianceFeatureSelection()`. This method is featured as the feature selection for Irizarry's GLM-PCA paper (https://doi.org/10.1186/s13059-019-1861-6). From its description, it computes a deviance statistic for each row feature for count data based on a multinomial null model that assumes each feature has a constant rate. Features with large deviance are likely to be informative. Uninformative, low deviance features can be discarded to speed up downstream analyses and reduce memory footprint. The `fam`parameter will be set to `binomial`, the default.
* M3Drop, which has two main functions:
    * NBDrop: the NBDrop model assumes proportion of zeros follows a Michaelis-Menten model. Then the Michaelis-Menten parameter $K$ is fitted. For each gene, its parameter $K_i$ is compared to $K$ using a $Z$-test, which returns the selected genes.
    * NBUmi: The procedure is similar to above, although the equation to fit now is a negative binomial model,  and the selection of genes is then done using a $Z$-test.
* `BrenneckeGetVariableGenes` fits a function between CV$^2$ and mean expression. 

With the exception of scanpy and triku, the rest of functions are set on $R$. We will use jupyter's `%%R` magic command, and `anndata2ri` to transform `annData` into `SingleCellExperiment` objects, and we will generate the functions to accept that annData and return the list of selected features. The functions have to be set up in notebook, and cannot be externalized. 

M3Drop requires a normalization step, which will be done in-situ.

In [None]:
%load_ext autoreload

In [None]:
%autoreload 2

In [None]:
import triku as tk
import scanpy as sc
import pandas as pd
import numpy as np
import scipy.sparse as spr
import scipy.stats as sts
import os

from tqdm.notebook import tqdm

from bokeh.io import show, output_notebook, reset_output
from bokeh.plotting import figure
from bokeh.models import LinearColorMapper

import matplotlib.pyplot as plt
import matplotlib as mpl
from matplotlib.lines import Line2D

from sklearn.metrics import adjusted_rand_score as ARS
from sklearn.metrics import adjusted_mutual_info_score as AMI

reset_output()
output_notebook()

In [None]:
import sys, os
sys.path.insert(0, os.getcwd() + '/code')

# Selection of palettes for cluster coloring, and scatter values
from comparing_feat_sel import clustering_binary_search, plot_max_var_x_dataset, plot_max_var_x_method, get_max_diff_gene
from palettes_and_cmaps import magma, bold_and_vivid, prism

In [None]:
%matplotlib inline

In [None]:
import rpy2.rinterface_lib.callbacks, logging
from rpy2.robjects import pandas2ri
import anndata2ri

In [None]:
# Ignore R warning messages
#Note: this can be commented out to get more verbose R output
rpy2.rinterface_lib.callbacks.logger.setLevel(logging.ERROR)

# Automatically convert rpy2 outputs to pandas dataframes
anndata2ri.activate()
pandas2ri.activate()
%load_ext rpy2.ipython
%load_ext rmagic

In [None]:
%%R
# Load all the R libraries we will be using in the notebook
library(M3Drop) # Depends on r-foreing (conda-forge) and Hmisc and reldist (install.packages)
library(scry) # If R < 4, launch commit 9f0fc819

In [None]:
os.makedirs(os.getcwd() + '/exports/comparisons/', exist_ok=True)

In the following 3 cells we will create the cells that obtain the most relevant features. Since some of the calls are to R, they have to be kept as separate cells. Also, we create the function `create_df_feature_ranking` which creates two dataframes: one with the evaluation values (p-value, emd distance, etc.) of each method, and the second one with the ranking of genes based on those values. These dataframes will be valuable so that we don't have to repeat the calling to the feature selection methods each time we do a graph. `create_df_feature_ranking` is also kept as a cell because it makes some calls to R.

In [None]:
%%R

run_scry <- function(sce){ #adata
    adata_ret = devianceFeatureSelection(sce, nkeep=dim(sce)[1])
    return(adata_ret) #returns adata with stats on .var
} 


run_brennecke <- function(sce){ #df
    res_df <- BrenneckeGetVariableGenes(sce, suppress.plot=TRUE, fdr=100)
    return(res_df) # returns sorted df with genes and stats
}


run_M3Drop <- function(sce){
    norm <- M3DropConvertData(sce, is.counts=TRUE)
    DE_genes <- M3DropFeatureSelection(norm, suppress.plot=TRUE, mt_threshold=50)
    return(DE_genes) # returns sorted df with genes and stats
    
}

run_NBumi <- function(sce){
    count_mat <- NBumiConvertData(sce, is.counts=TRUE)
    DANB_fit <- NBumiFitModel(count_mat)
    NBDropFS <- NBumiFeatureSelectionCombinedDrop(DANB_fit, suppress.plot=TRUE, qval.thresh=10)
    return(NBDropFS)  # returns sorted df with genes and stats
    
}


In [None]:
def run_scanpy(adata):
    adata = adata.copy()
    sc.pp.log1p(adata)
    ret = sc.pp.highly_variable_genes(adata, n_top_genes=len(adata), inplace=False)
    df = pd.DataFrame(ret)
    df =  df.set_index(adata.var_names)
    return df # returns df with stats

def run_variable(adata):
    if spr.issparse(adata.X):
        std = adata.X.power(2).mean(0) - np.power(adata.X.mean(0), 2) 
        std = np.asarray(std).flatten()        
    else:
        std = adata.X.std(0)
        
    return std #returns vector with order as var_names 

def run_triku(adata, seed):
    adata_copy = adata.copy()
    tk.tl.triku(adata_copy, n_comps=30, n_windows=100, random_state=seed, verbose='error')
    return adata_copy.var['emd_distance'] #pd series with distance

In [None]:
def create_df_feature_ranking(adata, title_prefix, ):
    """
    Create a dataframe with the ranking of features, and another one with the feature values. The adata must be the raw
    adata. From that we will create a adata_df necessary for some R methods.
    The adata will include:
    - Triku with 10 seeds 'triku_SEEDN'
    - Scanpy's HVG 'scanpy'
    - Std 'std'
    - scry 'scry'
    - brennecke 'brennecke'
    - M3Drop 'm3drop'
    - NBumi 'nbumi'
    
    After each method is run, we will fill the dataframe values, with the values of the metrics used for feature selection, 
    and the dataframe of rankings with the rankings based on the returned value (0, 1, 2, etc.). 
    We create two separate dataframes because the df with values might be reserved for other purposes. The rank dataframes is interesting
    because the values on the values dataframe have different argsort orders depending on the column (M3drop and NBumi direct, rest reverse).
    """
    
    adata = adata.copy()
#     sc.pp.subsample(adata, 0.05)
    sc.pp.filter_genes(adata, min_cells=1)
    sc.pp.filter_cells(adata, min_genes=1)
    try:
        adata.X = np.asarray(adata.X.todense())
    except:
        pass
    
    adata_groups = [i.replace('Group', '') for i in adata.obs['Group']]
    adata.obs['groupn'] = adata_groups
    adata_df = pd.DataFrame(adata.X.T.astype(int), index=adata.var_names, columns=adata.obs_names)
        
    adata_short = sc.AnnData(X = adata.X) # we have to create a clean adata because some column break Rpush
    adata_short.var_names, adata_short.obs_names = adata.var_names, adata.obs_names
    %Rpush adata_df
    %Rpush adata_short
    
    
    index, columns = adata.var_names, [f'triku_{i}' for i in range(10)] + ['scanpy', 'std', 'scry', 'brennecke', 'm3drop', 'nbumi']
    df_values, df_ranks = pd.DataFrame(index=index, columns=columns), pd.DataFrame(index=index, columns=columns)
    
    for i in range(10):
        df_emd_distance = run_triku(adata, i)
        df_values.loc[df_emd_distance.index, f'triku_{i}'] = df_emd_distance.values
        
    
    scanpy_ret = run_scanpy(adata)
    df_values.loc[scanpy_ret.index, 'scanpy'] = scanpy_ret['dispersions_norm'].values
    
    std_ret = run_variable(adata)
    df_values.loc[:, 'std'] = std_ret
    
    scry_ret = %R run_scry(adata_short)
    df_values.loc[scry_ret.var.index, 'scry'] = scry_ret.var['binomial_deviance'].values
    
    brennecke_ret = %R run_brennecke(adata_df)
    df_values.loc[brennecke_ret.index, 'brennecke'] = brennecke_ret['effect.size'].values
    
    M3Drop_ret = %R run_M3Drop(adata_df)
    df_values.loc[M3Drop_ret.index, 'm3drop'] = M3Drop_ret['q.value'].values
    
    NBumi_ret = %R run_NBumi(adata_df)
    df_values.loc[NBumi_ret.index, 'nbumi'] = NBumi_ret['q.value'].values
    
    # Now we will fill df_ranks with an argsort !!!!! M3DROP and NBumi is not [::-1] because they are q-values 
    for col in [f'triku_{i}' for i in range(10)] + ['scanpy', 'std', 'scry', 'brennecke']:
        df_ranks[col] = df_values[col].values.argsort()[::-1].argsort()
    for col in ['m3drop', 'nbumi']:
        df_ranks[col] = df_values[col].values.argsort().argsort() # double argsort to return the rank!
    
    df_ranks.to_csv(os.getcwd() + '/exports/comparisons/' + title_prefix + '_feature_ranks.csv')
    df_values.to_csv(os.getcwd() + '/exports/comparisons/' + title_prefix + '_feature_values.csv')
    
    return df_values, df_ranks

# Random datasets
For this section we will use the random datasets generated with splatter.
To evaluate the performance of the feature selection methods, we will use teo metrics, maximum deviation and ARI, explained below.

In [None]:
splatter_dir = os.path.dirname(os.getcwd()) + '/data/splatter/'

**THIS PROCESS TAKES ~ 5 HOURS!**

In [None]:
list_deprobs = [0.005, 0.0065, 0.008, 0.01, 0.013, 0.016, 0.025, 0.05, 0.1, 0.3]

for deprob in tqdm(list_deprobs):
    try:
        adata_deprob = sc.read(splatter_dir + f'/splatter_deprob_{deprob}.loom', cache=True)
    except:
        adata_deprob = sc.read(splatter_dir + f'/splatter_deprob_{deprob}.loom')
    create_df_feature_ranking(adata_deprob, f'scatter_{deprob}')

## Maximum difference (or maximum variation)
We define maximum variation as the maximum value of the differences between groups. The maximum variation is calculated in the following steps:
* For each group in the dataset, and for each gene, select the gene if any group has more than X% of expressing cells (25% by default). If all groups have fewer than X% of expressing cells, the variation is set to 0. 
* If the gene is selected, calculate the trimmed mean (2.5% lowest and highest values removed by default) for each group.
* Scale the mean expression array to 1.
* Sort the values and select $\max(|a[0] - a[X]|, |a[-X] - a[-1]|)$ (X = 3 by default). We select first and last values because generally either one or two clusters are overexpressed, or one cluster is undeexpressed, and the expression values will appear at the beginning or end of the mean expression array.

In [None]:
# Calculate the largest difference in all dataset

In [None]:
for deprob in tqdm(list_deprobs):
    try:
        adata_deprob = sc.read(splatter_dir + f'/splatter_deprob_{deprob}.loom', cache=True)
    except:
        adata_deprob = sc.read(splatter_dir + f'/splatter_deprob_{deprob}.loom')
    
    adata_deprob.X = np.asarray(adata_deprob.X.todense())
    
    groups = sorted(list(dict.fromkeys(adata_deprob.obs['Group'].values)))
    df_max_var = pd.DataFrame(index=adata_deprob.var_names, columns=groups + ['maximum_variation'])
    
    for gene in tqdm(adata_deprob.var_names):
        max_var, arr_info = get_max_diff_gene(adata_deprob, gene, 'Group')
        df_max_var.loc[gene, groups] = arr_info
        df_max_var.loc[gene, 'maximum_variation'] = max_var
        
    df_max_var.to_csv(os.getcwd() + '/exports/comparisons/' + f'scatter_{deprob}' + '_maximum_variation.csv')

We will plot two modalities:
* X axis represents the different feature selction methods, and each plot represents a dataset. We will select datasets with "conflictive" probabilities, like 0.01 or 0.016. For each feature selection methods, different features from 0-50, 50-100, 100-200, etc. are selected to show the stratification based on rank.
* X axis represents different datasets and, for each datasets, different feature selection methods are plotted. Each plot represents a number of features.

In [None]:
deprob = 0.016

df_max_var = pd.read_csv(os.getcwd() + '/exports/comparisons/' + f'scatter_{deprob}' + '_maximum_variation.csv', index_col=0)
df_feature_ranks = pd.read_csv(os.getcwd() + '/exports/comparisons/' + f'scatter_{deprob}' + '_feature_ranks.csv', index_col=0)
df_feature_ranks = df_feature_ranks[['triku_0'] + [i for i in df_feature_ranks.columns if 'triku' not in i]].rename(columns = {'triku_0': 'triku'})

plot_max_var_x_method(df_feature_ranks, df_max_var, feature_list = [0, 50, 100, 200, 500, 1000], title=f'Maximum difference by number of features, random w/ DE {deprob}')

We see that, at least for triku, differences are more apparent for the first features (50 - 100), and for larger numbers of features the difference distribution diminishes. For features in range 0 to 100 or 200 triku shows the highest values, followed by scanpy and scry. Interestingly, brennecke fails to show any proper feature up to the first 200 features, which is bizarre.

In [None]:
dict_df_feature_ranks, dict_df_max_var_dataset = {}, {}

for deprob in [0.005, 0.0065, 0.008, 0.01, 0.013, 0.016, 0.025, 0.05, 0.1, 0.3]:
    df_max_var = pd.read_csv(os.getcwd() + '/exports/comparisons/' + f'scatter_{deprob}' + '_maximum_variation.csv', index_col=0)
    df_feature_ranks = pd.read_csv(os.getcwd() + '/exports/comparisons/' + f'scatter_{deprob}' + '_feature_ranks.csv', index_col=0)
    df_feature_ranks = df_feature_ranks[['triku_0'] + [i for i in df_feature_ranks.columns if 'triku' not in i]].rename(columns = {'triku_0': 'triku'})
    
    dict_df_feature_ranks[deprob], dict_df_max_var_dataset[deprob] = df_feature_ranks, df_max_var
    

plot_max_var_x_dataset(dict_df_feature_ranks, dict_df_max_var_dataset, n_features=100, title='Maximum difference in datasets, 100 features')
plot_max_var_x_dataset(dict_df_feature_ranks, dict_df_max_var_dataset, n_features=200, title='Maximum difference in datasets, 200 features')
plot_max_var_x_dataset(dict_df_feature_ranks, dict_df_max_var_dataset, n_features=500, title='Maximum difference in datasets, 500 features')
plot_max_var_x_dataset(dict_df_feature_ranks, dict_df_max_var_dataset, n_features=2500, title='Maximum difference in datasets, 2500 features')

We see that, generally, triku shows the greatest difference across datasets, regardless of the number of features selected. For higher numbers of features the difference becomes less aparent at lower DE ranges(up to 0.025), simply because in those datasets the are not that many distinguishing features and, therefore, if more features are selected, they only contribute as *background noise*. 

To see why triku does perform better, we are going to choose a moderately difficult dataset (DE = 0.016), plot UMAPs for each method and its first features, and see any patterns that are missing.

In [None]:
deprob = 0.016
try:
    adata_deprob = sc.read(splatter_dir + f'/splatter_deprob_{deprob}.loom', cache=True)
except:
    adata_deprob = sc.read(splatter_dir + f'/splatter_deprob_{deprob}.loom')
    
adata_deprob.X = np.asarray(adata_deprob.X.todense())
    
tk.tl.triku(adata_deprob)
sc.pp.log1p(adata_deprob)
sc.pp.pca(adata_deprob, use_highly_variable=True)
sc.pp.neighbors(adata_deprob, n_neighbors=int(0.5 * len(adata_deprob) ** 0.5), metric='cosine')
sc.tl.umap(adata_deprob)

In [None]:
n_rows = 60
fig, axs = plt.subplots(n_rows, 7, figsize=(7 * 3, n_rows * 2.5))

rank_list = list(np.arange(15)) + list(np.linspace(15, 150, n_rows-15).astype(int))

for col in range(7):   
    for row in range(n_rows):
        method = df_feature_ranks.columns[col]
        gene = df_feature_ranks[method][df_feature_ranks[method] == rank_list[row]].index[0]
        max_var = df_max_var.loc[gene, 'maximum_variation']
        
        sc.pl.umap(adata_deprob, color=gene, cmap=magma, ax=axs[row][col], title='', show=False)
        
        axs[row][col].set_xlabel(f"{gene} ({rank_list[row]}), {max_var:.2f}")
        axs[row][col].set_ylabel('')
        axs[row][col].xaxis.set_label_position('top') 
        axs[row][col].axes.get_xaxis().set_ticks([])
        axs[row][col].axes.get_yaxis().set_ticks([])
    
    
for col in range(7):
    axs[0][col].set_title(df_feature_ranks.columns[col])
    
    
plt.tight_layout()

There is not a clear reason why triku selects clusters with more differences. Generally, triku selects genes with appear underexpressed in one cluster, and does not favour as much the selection of genes that are expressed equally among clusters but unequally within each cluster. Therefore, by selecting genes overexpressed or underexpressed in one cluster, its scores are better; but this does not mean that other methods do not do also that. Hoever, judging by the difference scores, it seems that the rest of methods prefer genes that are overexpressed within each cluster rather than overexpressed in one cluster. Again, this is a trend, all methods select genes with overexpression in one cluster (at least the more apparent ones).

Brennecke and M3Drop completely fail here because the distributions that are fitted for the methods do not correspond to the ones belonging to the gene count matrix. This does not mean that they won't perform as badly in real datasets, but shows how distribution-fitting dependent methods are unstable in other types of datasets. We will evaluate this performance in the biological benchmarking datasets, to see if they underperform in other-than-usual library preparation methods.

## ARI - Random datasets
Using ARI on random datasets is a measure to assess the effectiveness of the feature selection. Random datasets were prepared with different degrees of differentially expressed gene probability, so that we can compare the leiden clusterign solution with the 9 populations. Triku can be run with different seeds, but the rest of methods are deterministic. However, leiden clustering in all cases can be run with a seed. Therefore, we are going to run all processes with 10 seeds (although the deterministic processes will be run once).

To apply the ARI we need to run leiden with as many clusters as scatter populations. Since leiden runs on resolution, we need to adjust the resolution parameter to match the number of clusters. To do that we are going to implement a binary search-like algorithm. We will start with resolutions 0.3 and 2. If any of those yields the clusters, done. Else, find the midpoint, run the clustering, and if the clustering yields the number of populations, stop. Else, set the upper or lower resolution to the one that makes the desired number of clusters to be in the middle. This algorithm will try at most 5 times (it gets to resolution differences of ~0.05, which is fair).

We are going to create a dictionary where each element is a dataframe of ARIs run with each DE probability. The 
adata is of dims n_seeds x n_methods. Therefore, cell i,j will have the ARI between the populations and the optimal leiden clustering.

We are going to run the calculations with different feature numbers. To save time, we are going to use a high feature number (5000), store the selected features in places, and for future callings, use the saved data.

In [None]:
dict_df_deprob, dict_features = {}, {}
n_seeds, n_feats = 10, 200
min_res, max_res, max_depth = 0.1, 2, 6
seed = 0

In [None]:
deprob = 0.01

adata = sc.read(splatter_dir + f'/splatter_deprob_{deprob}.loom', cache=True)
sc.pp.filter_genes(adata, min_cells=1)
sc.pp.filter_cells(adata, min_genes=1)
sc.pp.subsample(adata, fraction=0.55)
adata.X = np.asarray(adata.X.todense())
adata_groups = [i.replace('Group', '') for i in adata.obs['Group']]
adata.obs['groupn'] = adata_groups
adata_df = pd.DataFrame(adata.X.T.astype(int), index=adata.var_names, columns=adata.obs_names)

In [None]:
n_feats = 100

In [None]:
adata_df = pd.DataFrame(adata.X.T.astype(int), index=adata.var_names, columns=adata.obs_names)
%Rpush adata_df n_feats

In [None]:
NBumi_feats = %R run_NBumi(adata_df, n_feats)
scanpy_feats = run_scanpy(adata, n_feats)
triku_feats, adata_triku = run_triku(adata, n_feats, 0)

In [None]:
for f in [triku_feats, scanpy_feats, NBumi_feats]:
    adata_plot = adata.copy()
    sc.pp.log1p(adata_plot)
    adata_plot.var['highly_variable'] = [i in f for i in adata_plot.var_names]
    sc.pp.pca(adata_plot, use_highly_variable=True)
    sc.pp.neighbors(adata_plot, n_neighbors=int(0.5 * len(adata_plot) ** 0.5), metric='cosine')

    
    c_f, res = clustering_binary_search(adata, min_res, max_res, max_depth, seed, len(list(dict.fromkeys(adata_groups))), f)
    print(f'{res} || ARI :',  ARS(c_f, adata_groups))
    
    adata_plot.obs['leiden'], adata_plot.obs['groups'] = c_f, adata_groups
    sc.tl.umap(adata_plot)
    sc.pl.umap(adata_plot, color=['leiden', 'groups'], legend_loc='on data')

In [None]:
for f in [triku_feats, scanpy_feats, NBumi_feats]:
    adata_plot = adata.copy()
    adata_plot.var['highly_variable'] = [i in f for i in adata_plot.var_names]
    sc.pp.pca(adata_plot, use_highly_variable=True)
    sc.pp.neighbors(adata_plot, n_neighbors=int(0.5 * len(adata_plot) ** 0.5), metric='cosine')

    
    c_f, res = clustering_binary_search(adata, min_res, max_res, max_depth, seed, len(list(dict.fromkeys(adata_groups))), f, apply_log=False)
    print(f'{res} || ARI :',  ARS(c_f, adata_groups))
    
    adata_plot.obs['leiden'], adata_plot.obs['groups'] = c_f, adata_groups
    sc.tl.umap(adata_plot)
    sc.pl.umap(adata_plot, color=['leiden', 'groups'], legend_loc='on data')

In [None]:
adata_scanpy = adata.copy()
adata_scanpy.var['highly_variable'] = [i in scanpy_feats for i in adata_plot.var_names]
sc.pp.pca(adata_scanpy, use_highly_variable=True)
sc.pp.neighbors(adata_scanpy, n_neighbors=int(0.5 * len(adata_plot) ** 0.5), metric='cosine')
sc.tl.umap(adata_scanpy)

adata_triku = adata.copy()
adata_triku.var['highly_variable'] = [i in triku_feats for i in adata_plot.var_names]
tk.tl.triku(adata_triku, n_features=n_features, n_windows=100, verbose='error')
sc.pp.pca(adata_triku, use_highly_variable=True)
sc.pp.neighbors(adata_triku, n_neighbors=int(0.5 * len(adata_plot) ** 0.5), metric='cosine')
sc.tl.umap(adata_triku)

In [None]:
common_genes = list(set(triku_feats) & set(scanpy_feats))
triku_only =list(set(triku_feats) - set(scanpy_feats))
scanpy_only = list(set(scanpy_feats) - set(triku_feats))
print(len(common_genes), len(triku_only), len(scanpy_only))

In [None]:
sc.pl.umap(adata_scanpy, color=['groupn'] + scanpy_only, legend_loc='on data', cmap=magma)

In [None]:
sc.pl.umap(adata_scanpy, color=['groupn'] + triku_only, legend_loc='on data', cmap=magma)

In [None]:
color = []
labels = []
alpha = [0.8 if i in triku_feats or i in scanpy_feats else 0.4 for i in adata.var_names]

for i in adata.var_names:
    if i in triku_feats and i in scanpy_feats:
        color.append("#000000")
        labels.append('Both')
    elif i in triku_feats:
        color.append("#900020")
        labels.append('Triku')
    elif i in scanpy_feats:
        color.append("#007ab7")
        labels.append('Scanpy')
    else:
        color.append("#bcbcbc")
        labels.append('None')


df_bokeh = pd.DataFrame({
    'm': np.log10(adata_triku.X.mean(0)),
    'z': (adata_triku.X == 0).sum(0) / adata_triku.shape[0],
    'n': adata_triku.var_names.values,
    'd': adata_triku.var["emd_distance"],
    'color': color, 'label':labels, 'alpha':alpha
    })[:]

In [None]:
p = figure(tools="box_zoom,hover,reset", plot_height=400, plot_width=400, tooltips=[("Gene","@n"), ('Value', '@d')])

p.scatter('m', 'd', source=df_bokeh,
          alpha=0.7, line_color=None,
         color='color', legend_group='label')

p.legend.location = 'top_left'
show(p)

In [None]:
def get_info_gene(adata, gene, group_col):
    mean_exp_val = []
    
    groups = sorted(list(set(adata.obs[group_col].values)))
    dict_argwhere = {group: np.argwhere(adata.obs[group_col].values == group) for group in groups}
    if spr.issparse(adata.X):
        exp_gene = np.asarray(adata[:,gene].X.todense()).ravel()
    else:
        exp_gene = adata[:,gene].X.ravel()
    
    for g in groups:
        exp_group = np.sort(exp_gene[dict_argwhere[g]].ravel())
        mean_exp_val.append(np.mean(exp_group[: int(0.90 * len(exp_group))])) # for genes with small expression it may amplify noise
        
    
    mean_exp_val_1 = np.array(mean_exp_val)/sum(mean_exp_val)
    
#     info = sts.entropy(mean_exp_val_1) / np.log(len(groups))
    info = max(np.sort(mean_exp_val_1)[3] - np.sort(mean_exp_val_1)[0], np.sort(mean_exp_val_1)[-1] - np.sort(mean_exp_val_1)[-4])
    
    return info, mean_exp_val_1
    


In [None]:
gene = 'Gene13823'

In [None]:
adata_deprob.X = np.asarray(adata_deprob.X.todense())

In [None]:
sc.pl.umap(adata_triku, color=['groupn', gene], legend_loc='on data', cmap=magma)
sc.pl.umap(adata_scanpy, color=['groupn', gene], legend_loc='on data', cmap=magma)

In [None]:
dict_df_deprob, dict_features = {}, {}
n_seeds, n_feats = 10, 5000
min_res, max_res, max_depth = 0.1, 2, 6

for deprob in [0.005, 0.0065, 0.008, 0.01, 0.013, 0.016, 0.025, 0.05, 0.1, 0.3]:
    adata = sc.read_loom(splatter_dir + f'/splatter_deprob_{deprob}.loom')
    adata.X = np.asarray(adata.X.todense())
    adata_groups = [i.replace('Group', '') for i in adata.obs['Group']]
    adata_df = pd.DataFrame(np.asarray(adata.X.T.todense()), index=adata.var_names, columns=adata.obs_names)
    
    %Rpush adata_df n_feats
    
    df = pd.DataFrame(index = np.arange(n_seeds), columns=['triku', 'var', 'scanpy', 'scry', 'brennecke', 'M3Drop', 'NBUmi'])
    
    for seed in np.arange(n_seeds):
        if seed == 0:
            triku_feats = run_triku(adata, n_feats, seed)
            var_feats = run_variable(adata, n_feats)
            scanpy_feats = run_scanpy(adata, n_feats)
            NBumi_feats = %R run_NBumi(adata_df, n_feats)
            scry_feats = %R run_scry(adata_df, n_feats)
            brennecke_feats = %R run_brennecke(adata_df, n_feats)
            M3Drop_feats = %R run_M3Drop(adata_df, n_feats)
            
            dict_features[f'triku_{seed}'], dict_features['var'], dict_features['scanpy'] = triku_feats, var_feats, scanpy_feats
            dict_features['scry'], dict_features['brennecke'], dict_features['M3Drop'], dict_features['NBUmi'] = scry_feats, brennecke_feats, M3Drop_feats, NBumi_feats
        
        else:
            triku_feats = run_triku(adata, n_feats, seed)
            dict_features[f'triku_{seed}'] = triku_feats
            
        # Run clustering with each method and get ARI
        c_triku = clustering_binary_search(adata, min_res, max_res, max_depth, seed, len(list(dict.fromkeys(adata_groups))), triku_feats)
        df.loc[seed, 'triku'] = ARI(c_triku, adata_groups)
        c_var = clustering_binary_search(adata, min_res, max_res, max_depth, seed, len(list(dict.fromkeys(adata_groups))), var_feats)
        df.loc[seed, 'var'] = ARI(c_var, adata_groups)
        c_scanpy = clustering_binary_search(adata, min_res, max_res, max_depth, seed, len(list(dict.fromkeys(adata_groups))), scanpy_feats)
        df.loc[seed, 'scanpy'] = ARI(c_scanpy, adata_groups)

        c_scry = clustering_binary_search(adata, min_res, max_res, max_depth, seed, len(list(dict.fromkeys(adata_groups))), scry_feats)
        df.loc[seed, 'scry'] = ARI(c_scry, adata_groups)
        c_brennecke = clustering_binary_search(adata, min_res, max_res, max_depth, seed, len(list(dict.fromkeys(adata_groups))), brennecke_feats)
        df.loc[seed, 'brennecke'] = ARI(c_brennecke, adata_groups)
        c_M3Drop = clustering_binary_search(adata, min_res, max_res, max_depth, seed, len(list(dict.fromkeys(adata_groups))), M3Drop_feats)
        df.loc[seed, 'M3Drop'] = ARI(c_M3Drop, adata_groups)
        c_NBumi = clustering_binary_search(adata, min_res, max_res, max_depth, seed, len(list(dict.fromkeys(adata_groups))), scry_NBumi)
        df.loc[seed, 'NBumi'] = ARI(c_NBumi, adata_groups)
        
        

In [None]:
%%R

aaa <- function(a, b, c){
    return(a+b+c)
}

In [None]:
a = 9
b = 19
c = 99

In [None]:
%%R -i a -i b -i c -o d

d = aaa(a,b,c)

In [None]:
d

In [None]:
%Rpush a b c
d = %R aaa(a,b,c)

In [None]:
d

In [None]:
%%R -i df -i n -o a

a = run_scry(df, n)

In [None]:
n = 1000

In [None]:
%Rpush df n
a = %Rget run_scry(df, n)

In [None]:
a

In [None]:
adata = sc.read_loom(splatter_dir + f'/splatter_deprob_{0.01}.loom')
adata.obs['groups'] = [i.replace('Group', '') for i in adata.obs['Group']]

In [None]:
adata.obs['Group']

In [None]:
adata = sc.datasets.pbmc3k()
sc.pp.filter_genes(adata, min_cells=10)
df = pd.DataFrame(np.asarray(adata.X.T.todense()), index=adata.var_names, columns=adata.obs_names)

In [None]:
%%R -i df -i df -o run_M3Drop

run_M3Drop <- run_NBumi(df, 1500)


In [None]:
%%R -i df -i adata -o scry_ret -o brennecke_ret -o M3Drop_ret -o NBumi_ret

scry_ret <- run_scry(adata, 1500)
brennecke_ret <- run_brennecke(adata, 1500)
M3Drop_ret <- run_M3Drop(df, 1500)
NBumi_ret <- run_NBumi(df, 1500) 

In [None]:
run_M3Drop

In [None]:
import sys, os
sys.path.insert(0, os.getcwd() + '/code')

# Selection of palettes for cluster coloring, and scatter values
from palettes_and_cmaps import magma, bold_and_vivid
from robustness_functions import run_batch, random_noise_parameter, plot_scatter_parameter, compare_parameter, plot_scatter_datasets


In [None]:
save_dir = os.getcwd() + '/exports/'

In [None]:
data_dir = os.path.dirname(os.getcwd()) + '/data/'

In [None]:
import os

In [None]:
save_dir = os.getcwd() + '/exports/'
read_dir = data_dir + 'Ding_2020/'

In [None]:
# To be able to read you must have: EXACTLY in that manner:
# matrix.mtx.gz (you MUST rename it)
# features.tsv(you should rename it)
# barcodes.tsv (you should rename it)
from scipy.io import mmread

matrix = mmread('/media/seth/SETH_DATA/SETH_Alex/triku/data/Ding_2020/preprocessed/mouse/matrix.mtx.gz')
features = np.loadtxt('/media/seth/SETH_DATA/SETH_Alex/triku/data/Ding_2020/preprocessed/mouse/features.tsv', dtype=str)
barcodes = np.loadtxt('/media/seth/SETH_DATA/SETH_Alex/triku/data/Ding_2020/preprocessed/mouse/barcodes.tsv', dtype=str)

In [None]:
adata = sc.AnnData(X=matrix.tocsr()).transpose()
adata.var_names = features
adata.obs_names = barcodes

In [None]:
meta = pd.read_csv('/media/seth/SETH_DATA/SETH_Alex/triku/data/Ding_2020/preprocessed/mouse/meta_combined.txt', sep='\t', skiprows=[1])
adata = adata[meta['NAME'].values]
adata.obs['method'] = meta['Method'].values
adata.obs['CellType'] = meta['CellType'].values

In [None]:
adata

In [None]:
a = 1_000
a-1

In [None]:
meta

In [None]:
adata

In [None]:
matrix

In [None]:
adata = sc.read_text(read_dir + '/GSE133545_SMARTseq2_human_exp_mat.tsv').transpose()
adata.var_names_make_unique()
sc.pp.filter_genes(adata, min_cells=30)

In [None]:
tk.tl.triku(adata, verbose='triku', n_procs=4, knn=50)

In [None]:
def func_a(a, b): return np.convolve(a, b, )
def func_b(a, b): 
    x = fftconvolve(a, b, )
    x[x < 0] = 0
    
    return x

def apply_convolution_read_counts(probs: np.ndarray, knn: int, func) -> (np.ndarray, np.ndarray):
    """
    Convolution of functions. The function applies a convolution using np.convolve
    of a probability distribution knn times. The result is an array of N elements (N arises as the convolution
    of a n-length array knn times) where the element i has the probability of i being observed.

    Parameters
    ----------
    probs : np.array
        Object with count matrix. If `pandas.DataFrame`, rows are cells and columns are genes.
    knn : int
        Number of kNN
    """
      
    
    # We are calculating the convolution of cells with positive expression. Thus, in the first distribution
    # we have to remove the cells with 0 reads, and rescale the probabilities.
    arr_0 = probs.copy()
    arr_0[0] = 0  # TODO: this will fail in log-transformed data
    arr_0 /= arr_0.sum()

    # We will use arr_bvase as the array with the read distribution
    arr_base = probs.copy()
    
    arr_convolve = func(arr_0, arr_base, )
    
    for knni in range(2, knn):
        arr_convolve = func(arr_convolve, arr_base, )

    # TODO: check the probability sum is 1 and, if so, remove
    arr_prob = arr_convolve / arr_convolve.sum()

    # TODO: if log transformed, this is untrue. Should not be arange.
    return np.arange(len(arr_prob)), arr_prob

In [None]:
counts_gene = adata.X[:, 377]
from tqdm.notebook import tqdm
import time

In [None]:
times_a, sums_a, times_b, sums_b = [], [], [], []

In [None]:
for i in tqdm(range(500, 1000)):
    counts_gene = adata.X[:, i]
    y_probs = np.bincount(counts_gene.astype(int)) / len(counts_gene)
    t = time.time()
    apply_convolution_read_counts(y_probs, 50, func_a)
    times_a.append(time.time() - t)
    sums_a.append(counts_gene.sum())
    
    
    t = time.time()
    apply_convolution_read_counts(y_probs, 50, func_b)
    times_b.append(time.time() - t)
    sums_b.append(counts_gene.sum())

In [None]:
fig = plt.figure(figsize=(15,8))
plt.scatter(np.log10(sums_a), np.log10(times_a))
plt.scatter(np.log10(sums_b), np.log10(times_b))

In [None]:
y_probs, len(y_probs)

In [None]:
a = apply_convolution_read_counts(y_probs, 8)

In [None]:
b = apply_convolution_read_counts(y_probs, 8)

In [None]:
np.sum(((a[1]-b[1])**2)**0.5)