# Comparing other feature selection methods with triku
In this notebook we will compare the performance of triku, compared to other methods. 

The methods that will be compared will be the following:
* Select genes with highest variance.   
* Scanpy's `sc.pp.highly_variable_genes`: It is based on Seurat's `vst` method, so they should return similar results.
* scry `devianceFeatureSelection()`. This method is featured as the feature selection for Irizarry's GLM-PCA paper (https://doi.org/10.1186/s13059-019-1861-6). From its description, it computes a deviance statistic for each row feature for count data based on a multinomial null model that assumes each feature has a constant rate. Features with large deviance are likely to be informative. Uninformative, low deviance features can be discarded to speed up downstream analyses and reduce memory footprint. The `fam`parameter will be set to `binomial`, the default.
* M3Drop, which has two main functions:
    * NBDrop: the NBDrop model assumes proportion of zeros follows a Michaelis-Menten model. Then the Michaelis-Menten parameter $K$ is fitted. For each gene, its parameter $K_i$ is compared to $K$ using a $Z$-test, which returns the selected genes.
    * NBUmi: The procedure is similar to above, although the equation to fit now is a negative binomial model,  and the selection of genes is then done using a $Z$-test.
* `BrenneckeGetVariableGenes` fits a function between CV$^2$ and mean expression. 

With the exception of scanpy and triku, the rest of functions are set on $R$. We will use jupyter's `%%R` magic command, and `anndata2ri` to transform `annData` into `SingleCellExperiment` objects, and we will generate the functions to accept that annData and return the list of selected features. The functions have to be set up in notebook, and cannot be externalized. 

M3Drop requires a normalization step, which will be done in-situ.

In [None]:
%load_ext autoreload

In [None]:
%autoreload 2

In [None]:
import triku as tk
import scanpy as sc
import pandas as pd
import numpy as np
import scipy.sparse as spr
import scipy.stats as sts
import os
import gc
from itertools import product
import pickle
import ray
import seaborn as sns
import itertools 

from IPython.display import display, HTML

from tqdm.notebook import tqdm

from bokeh.io import show, output_notebook, reset_output
from bokeh.plotting import figure
from bokeh.models import LinearColorMapper

import matplotlib.pyplot as plt
import matplotlib as mpl
from matplotlib.lines import Line2D

from sklearn.metrics import adjusted_rand_score as ARS
from sklearn.metrics import adjusted_mutual_info_score as NMI
from sklearn.metrics import silhouette_score, davies_bouldin_score

reset_output()
output_notebook()

In [None]:
!python setup.py install

In [None]:
import sys, os
sys.path.insert(0, os.getcwd() + '/code')

from triku_nb_code.comparing_feat_sel import plot_max_var_x_dataset, plot_max_var_x_method, create_dict_UMAPs_datasets, \
get_max_diff_gene, plot_ARI_x_method, plot_ARI_x_dataset, biological_silhouette_ARI_table, plot_lab_org_comparison_scores, \
clustering_binary_search, compare_rankings, compare_values
from triku_nb_code.comparing_feat_sel import create_UMAP_adataset_libprep_org, plot_UMAPs_datasets, plot_XY, biological_silhouette_ARI_table
from triku_nb_code.palettes_and_cmaps import magma, bold_and_vivid, prism

In [None]:
%matplotlib inline

In [None]:
import anndata2ri
anndata2ri.activate()
%load_ext rpy2.ipython

In [None]:
%%R
# Load all the R libraries we will be using in the notebook
library(M3Drop) # Depends on r-foreing (conda-forge) and Hmisc and reldist (install.packages)
library(scry) # If R < 4, launch commit 9f0fc819

In [None]:
os.makedirs(os.getcwd() + '/exports/comparisons/', exist_ok=True)
os.makedirs(os.getcwd() + '/figures/comparison_figs/png', exist_ok=True)
os.makedirs(os.getcwd() + '/figures/comparison_figs/pdf', exist_ok=True)
data_dir = os.getcwd() + '/data/'

In the following 3 cells we will create the cells that obtain the most relevant features. Since some of the calls are to R, they have to be kept as separate cells. Also, we create the function `create_df_feature_ranking` which creates two dataframes: one with the evaluation values (p-value, emd distance, etc.) of each method, and the second one with the ranking of genes based on those values. These dataframes will be valuable so that we don't have to repeat the calling to the feature selection methods each time we do a graph. `create_df_feature_ranking` is also kept as a cell because it makes some calls to R.

In [None]:
%%R

run_scry <- function(sce){ #adata
    adata_ret = devianceFeatureSelection(sce, nkeep=dim(sce)[1], assay='X')
    return(adata_ret) #returns adata with stats on .var
} 


run_brennecke <- function(sce){ #df
    res_df <- BrenneckeGetVariableGenes(sce, suppress.plot=TRUE, fdr=100)
    return(res_df) # returns sorted df with genes and stats
}


run_M3Drop <- function(sce){
    norm <- M3DropConvertData(sce, is.counts=TRUE)
    DE_genes <- M3DropFeatureSelection(norm, suppress.plot=TRUE, mt_threshold=50)
    return(DE_genes) # returns sorted df with genes and stats
    
}

run_NBumi <- function(sce){
    count_mat <- NBumiConvertData(sce, is.counts=TRUE)
    DANB_fit <- NBumiFitModel(count_mat)
    NBDropFS <- NBumiFeatureSelectionCombinedDrop(DANB_fit, suppress.plot=TRUE, qval.thresh=10)
    return(NBDropFS)  # returns sorted df with genes and stats
    
}


In [None]:
def run_scanpy(adata):
    adata_copy = adata.copy()
    if not 'log1p' in adata_copy.uns:
        sc.pp.log1p(adata_copy)
    ret = sc.pp.highly_variable_genes(adata_copy, n_top_genes=len(adata_copy), inplace=False)
    df = pd.DataFrame(ret)
    df =  df.set_index(adata_copy.var_names)
    del adata_copy; gc.collect()
    return df # returns df with stats

def run_variable(adata):
    if spr.issparse(adata.X):
        std = adata.X.power(2).mean(0) - np.power(adata.X.mean(0), 2) 
        std = np.asarray(std).flatten()        
    else:
        std = adata.X.std(0)
        
    return std #returns vector with order as var_names 

def run_triku(adata, seed):
    adata_copy = adata.copy()
    tk.tl.triku(adata_copy, n_comps=30, n_windows=100, random_state=seed, verbose='error', n_procs=1)
    d = adata_copy.var['triku_distance'] #pd series with distance
    del adata_copy; gc.collect()
    return d

In [None]:
def create_df_feature_ranking(adatax, title_prefix, apply_log=False):
    """
    Create a dataframe with the ranking of features, and another one with the feature values. The adata must be the raw
    adata. From that we will create a adata_df necessary for some R methods.
    The adata will include:
    - Triku with 10 seeds 'triku_SEEDN'
    - Scanpy's HVG 'scanpy'
    - Std 'std'
    - scry 'scry'
    - brennecke 'brennecke'
    - M3Drop 'm3drop'
    - NBumi 'nbumi'
    
    After each method is run, we will fill the dataframe values, with the values of the metrics used for feature selection, 
    and the dataframe of rankings with the rankings based on the returned value (0, 1, 2, etc.). 
    We create two separate dataframes because the df with values might be reserved for other purposes. The rank dataframes is interesting
    because the values on the values dataframe have different argsort orders depending on the column (M3drop and NBumi direct, rest reverse).
    """
    
    adata = adatax.copy()
    sc.pp.filter_genes(adata, min_cells=1) 
    sc.pp.filter_cells(adata, min_genes=1)
    
    if apply_log:
        sc.pp.log1p(adata)
    
    adata_df = pd.DataFrame(adata.X.T, index=adata.var_names, columns=adata.obs_names)
        
    adata_short = sc.AnnData(X = adata.X[:,:]) # we have to create a clean adata because some column break Rpush
    adata_short.var_names, adata_short.obs_names = adata.var_names[:], adata.obs_names[:]
    print(adata_short.shape)
    
    %Rpush adata_short
    %Rpush adata_df
    
    print('Outside R', adata.shape, adata_short.shape)
    d = %R  dim(adata_short)
    print('Inside R', d)
       
    if 'Group' in adata.obs:
        adata_groups = [i.replace('Group', '') for i in adata.obs['Group']]
        adata.obs['groupn'] = adata_groups
    
    index, columns = adata.var_names, [f'triku_{i}' for i in range(10)] + ['scanpy', 'std', 'scry', 'brennecke', 'm3drop', 'nbumi']
    df_values, df_ranks = pd.DataFrame(index=index, columns=columns), pd.DataFrame(index=index, columns=columns)
    
    for i in range(10):
        df_emd_distance = run_triku(adata, i)
        df_values.loc[df_emd_distance.index, f'triku_{i}'] = df_emd_distance.values
        
    
    scanpy_ret = run_scanpy(adata)
    df_values.loc[scanpy_ret.index, 'scanpy'] = scanpy_ret['dispersions_norm'].values
    assert len(df_values.index) == len(adata.var_names)
    
    std_ret = run_variable(adata)
    df_values.loc[:, 'std'] = std_ret
    assert len(df_values.index) == len(adata.var_names)
    
    scry_ret = %R run_scry(adata_short)
    df_values.loc[scry_ret.var.index, 'scry'] = scry_ret.var['binomial_deviance'].values
    assert len(df_values.index) == len(adata.var_names)
    
    brennecke_ret = %R run_brennecke(adata_df)
    df_values.loc[brennecke_ret.index, 'brennecke'] = brennecke_ret['effect.size'].values
    assert len(df_values.index) == len(adata.var_names)
    
    M3Drop_ret = %R run_M3Drop(adata_df)
    df_values.loc[M3Drop_ret.index, 'm3drop'] = M3Drop_ret['q.value'].values
    assert len(df_values.index) == len(adata.var_names)
    
    NBumi_ret = %R run_NBumi(adata_df)
    df_values.loc[NBumi_ret.index, 'nbumi'] = NBumi_ret['q.value'].values
    assert len(df_values.index) == len(adata.var_names)
    
    # Now we will fill df_ranks with an argsort !!!!! M3DROP and NBumi is not [::-1] because they are q-values 
    for col in [f'triku_{i}' for i in range(10)] + ['scanpy', 'std', 'scry', 'brennecke']:
        df_ranks[col] = df_values[col].values.argsort()[::-1].argsort()
    for col in ['m3drop', 'nbumi']:
        df_ranks[col] = df_values[col].values.argsort().argsort() # double argsort to return the rank!
    
    df_ranks.to_csv(os.getcwd() + '/exports/comparisons/' + title_prefix + '_feature_ranks.csv')
    df_values.to_csv(os.getcwd() + '/exports/comparisons/' + title_prefix + '_feature_values.csv')
    print('df_ranks', df_ranks.shape)
    
    del adata; gc.collect()
    return df_values, df_ranks

# Random datasets
For this section we will use the random datasets generated with splatter.
To evaluate the performance of the feature selection methods, we will use teo metrics, maximum deviation and ARI, explained below.

In [None]:
splatter_dir = os.getcwd() + '/data/splatter/'
list_deprobs = [0.0065, 0.008, 0.01, 0.016, 0.025, 0.05, 0.1, 0.3]

**THIS PROCESS TAKES ~ 4 HOURS!**

Also... this cell sometimes fails to load. Running it again makes it go fine. 

In [None]:
for deprob in tqdm(list_deprobs):
    print(f'Deprob {deprob}')
    adata_deprob = sc.read(splatter_dir + f'/splatter_deprob_{deprob}.loom', sparse=False)
    print(f'Adata {deprob} loaded: {adata_deprob.X.shape}')
    df_values, df_ranks = create_df_feature_ranking(adata_deprob, f'scatter_{deprob}')

## ARI / NMI
Using ARI on random datasets is a measure to assess the effectiveness of the feature selection. Random datasets were prepared with different degrees of differentially expressed gene probability, so that we can compare the leiden clusterign solution with the 9 populations. Triku can be run with different seeds, but the rest of methods are deterministic. However, leiden clustering in all cases can be run with a seed. Therefore, we are going to run all processes with 10 seeds (although the deterministic processes will be run once).

To apply the ARI we need to run leiden with as many clusters as scatter populations. Since leiden runs on resolution, we need to adjust the resolution parameter to match the number of clusters. To do that we are going to implement a binary search-like algorithm. We will start with resolutions 0.3 and 2 (may change in the future). If any of those yields the clusters, done. Else, find the midpoint, run the clustering, and if the clustering yields the number of populations, stop. Else, set the upper or lower resolution to the one that makes the desired number of clusters to be in the middle. This algorithm will try at most 5 times (it gets to resolution differences of ~0.05, which is fair).

To calculate the ARI, we need to load a dataset, select a number of features, and create the dataframe with seeds as rows (to see varation on clustering / triku) and the methods as columns. Because creating each dataframe take time (there are 70 cells to be filled), we will choose two datasets (DE = 0.01 and 0.025) and two number of features (100 and 500), which show good results in the previous sections. 

In [None]:
save_dir = os.getcwd() + '/exports/comparisons/'

In [None]:
min_res, max_res, max_depth = 0.3, 2, 6

In [None]:
@ray.remote
def leiden_adata_NMI(deprobx):
    print(deprobx)
    adata_all = sc.read(splatter_dir + f'/splatter_deprob_{deprobx}.loom', sparse=False)
    sc.pp.subsample(adata_all, 0.4) # We shorthen this to make the calculations not take 8 hours!
    sc.pp.filter_genes(adata_all, min_cells=1)
    sc.pp.filter_cells(adata_all, min_genes=1)
    sc.pp.log1p(adata_all)
    
    for n_features in [250, 500]:
        print(deprobx, n_features)
        if not os.path.exists(os.getcwd() + f'/exports/comparisons/NMI_scatter_{deprobx}_n_features_{n_features}.csv'):
            df_feature_ranks = pd.read_csv(os.getcwd() + '/exports/comparisons/' + f'scatter_{deprobx}' + '_feature_ranks.csv', index_col=0)

            list_methods = ['triku'] + [i for i in df_feature_ranks.columns if not i.startswith('triku')] + ['all', 'random']

            df_NMI = pd.DataFrame(index=[f'seed_{i}' for i in range(10)], columns=list_methods)

            for seed in range(10):
                print(deprobx, n_features, seed)
                for method in tqdm(list_methods):
                    if method.startswith('triku'):
                        feats = df_feature_ranks[f'triku_{seed}'].sort_values().index[:n_features]
                    
                    elif method == "all":
                        feats = df_feature_ranks[f'triku_{seed}'].sort_values().index[:]
                    elif method == "random":
                        array_selection = np.array([False] * len(df_feature_ranks))
                        array_selection[np.random.choice(np.arange(len(df_feature_ranks)), n_features, replace=False)] = True
                        
                        feats = df_feature_ranks[f'triku_{seed}'].sort_values().index[array_selection]
                    
                    else:
                        feats = df_feature_ranks[method].sort_values().index[:n_features]
                    
                    adata_groups = [i.replace('Group', '') for i in adata_all.obs['Group']]
                    c_f, res = clustering_binary_search(adata_all, min_res, max_res, max_depth, seed, len(list(dict.fromkeys(adata_groups))), feats, apply_log=False)
                    NMS = NMI(c_f, adata_groups)

                    df_NMI.loc[f'seed_{seed}', method] = NMS
            
            print(os.getcwd() + f'/exports/comparisons/NMI_scatter_{deprobx}_n_features_{n_features}.csv')
            df_NMI.to_csv(os.getcwd() + f'/exports/comparisons/NMI_scatter_{deprobx}_n_features_{n_features}.csv')

In [None]:
ray.init(num_cpus=min(os.cpu_count(), len(list_deprobs)), ignore_reinit_error=True)
ray_get = ray.get([leiden_adata_NMI.remote(deprobx) for deprobx in list_deprobs])
ray.shutdown()

In [None]:
help(plot_lab_org_comparison_scores)

#### Figure 3

In [None]:
for n_feats in ['250', '500']:
    list_files = [f'NMI_scatter_{deprob}_n_features_{n_feats}.csv' for deprob in list_deprobs[::-1]]
    plot_lab_org_comparison_scores(f'NMI-{n_feats}', '', save_dir, [''], increasing=0, mode='NMI', list_files=list_files, 
                                       title=f'NMI on artificial datasets, {n_feats} features', 
                                       filename=f'NMI_{n_feats}-features')

For lower number of features (250) scanpy performs best at lower DE probabilities (up to 0.025) but performs worse at full resolution (0.1 or 0.3), with scry the best method for that, principally because the features that make smaller clusters separate are the ones with most expression, and those are the ones selected by scry. However, at smaller DE probabilities, the features that separate the dataset the most are the ones with mid expression levels, which are best picked by triku. 

In [None]:
df_values = pd.read_csv(f'{save_dir}/scatter_0.01_feature_values.csv', index_col=0)
df_ranks = pd.read_csv(f'{save_dir}/scatter_0.01_feature_ranks.csv', index_col=0)

In [None]:
df_ranks

In [None]:
df_values

In [None]:
adata = sc.read(splatter_dir + f'/splatter_deprob_0.01.loom', sparse=False)
sc.pp.subsample(adata, 0.4) # We shorthen this to make the calculations not take 8 hours!

adata.raw = adata
sc.pp.pca(adata)
sc.pp.neighbors(adata)
sc.tl.umap(adata)

In [None]:
df_bokeh = pd.DataFrame({'m': np.log10(adata.X.mean(0)), 
                         'z': df_values['triku_0'].loc[adata.var_names].values, 
                         'n': df_values.index.values})
                   
p = figure(tools="box_zoom,hover,reset", plot_height=600, plot_width=600, tooltips=[("Gene","@n")])
p.scatter('m', 'z', source=df_bokeh, alpha=0.7, line_color=None)
show(p)

#### Supplementary Figure 2

In [None]:
triku_hvg = df_ranks['triku_0'].values < 250
scry_hvg = df_ranks['scry'].values < 250
std_hvg = df_ranks['std'].values < 250
scanpy_hvg = df_ranks['scanpy'].values < 250

fig, axs = plt.subplots(1, 4, figsize=(12, 3))
names = ['triku', 'scanpy', 'std', 'scry',]
hvgs = [triku_hvg, scanpy_hvg, std_hvg, scry_hvg]
colors = ['#e73f74', '#7f3c8d', '#11a579', '#3969ac']

for i in range(4):
    axs[i].scatter(np.log10(adata.X.mean(0))[~hvgs[i]][::5], 
                   df_values['triku_0'].loc[adata.var_names].values[~hvgs[i]][::5], c="#dedede", s=2, alpha=0.7)
    axs[i].scatter(np.log10(adata.X.mean(0))[hvgs[i]], 
                   df_values['triku_0'].loc[adata.var_names].values[hvgs[i]], c=colors[i], s=2, label=names[i])
    axs[i].legend()

fig.text(0.0, 0.5, 'Wasserstein distance', va='center', rotation='vertical')
fig.text(0.5, 0.0, 'log$_{10}$ mean expression', va='center', rotation='horizontal')
plt.tight_layout()
plt.savefig(os.getcwd() + f'/figures/comparison_figs/pdf/barplots_scatter.pdf', fmt='pdf')


list_list_genes = [['Gene1118', 'Gene8599', 'Gene1513', 'Gene1479'],     # triku only
                   ['Gene6723', 'Gene6625', 'Gene9796', 'Gene935'],      # triku + scanpy
                   ['Gene12841', 'Gene10739', 'Gene6729', 'Gene12240'],  # scanpy
                   ['Gene9545', 'Gene4459', 'Gene383', 'Gene12455'],     # all
                   ['Gene1633', 'Gene10792', 'Gene2496', 'Gene12497']]   # std + scry

list_bar_colors = ['#94346E', '#E17C05', '#0F8554', '#1D6996']

for lg_idx, list_genes in enumerate(list_list_genes):
    fig, axs = plt.subplots(1, 5, figsize=(3*5, 3))
    
    hvg = np.isin(adata.var_names, list_genes)
    axs[0].scatter(np.log10(adata.X.mean(0))[~hvg][::5], 
                   df_values['triku_0'].loc[adata.var_names].values[~hvg][::5], c="#bcbcbc", s=2, alpha=0.7)
    
    for i in range(1, 5):
        for group in range(10):
            axs[0].scatter(np.log10(adata.X.mean(0))[np.isin(adata.var_names, list_genes[i - 1])][::5], 
                           df_values['triku_0'].loc[adata.var_names].values[np.isin(adata.var_names, list_genes[i - 1])][::5], c=list_bar_colors[i - 1], s=7)

                
            data_values = adata[adata.obs['Group'] == 'Group' + str(group + 1)].X[:, np.argwhere(adata.var_names == list_genes[i-1])[0]].flatten()
            mean, std = np.mean(data_values), np.std(data_values)

            axs[i].bar(group + 1, mean, color=list_bar_colors[i - 1])
            
    axs[0].set_ylabel('Wasserstein distance')
    axs[0].set_xlabel('log$_{10}$ mean expression')
    axs[1].set_ylabel('Mean group expression')
    axs[1].set_xlabel('Group')

    plt.tight_layout()

    plt.savefig(os.getcwd() + f'/figures/comparison_figs/pdf/barplots_{lg_idx}.pdf', fmt='pdf')

# Ding et al. / Mereu et al. datasets
Now that we have seen that triku outperforms other methods in artificial datasets, at least when there is intrinsic noisiness, we are going to apply similar metrics to biological datasets. We are first going to use Mereu's and Ding's human + mouse benchmarking datasets. They will help us see biases on performance of all the methods, and also it will act as a validation of the results from the original papers.

In this part, due to the large amount of datasets, and also due to the heterogeneity of genes, we will not apply use different number of features. Instead, we will run triku with seed 0, and select the default number of features that is automatically generated to select the features on the rest of methods. This will mean that different datasets will have different number of features, although each dataset will have the same number of features across methods. 

The two main methods that we will use to evaluate the feature selection are NMI and Silhouette scores.
* NMI uses the assigned cell types from the paper (Mereu et al. use MatchSCore2 and Ding et al. uses a custom algorithm) and applies the same binary search for resolution.
* Silhouette. It is used in two forms:
    * Apply the same resolution to all datasets and all methods using the binary search, and get the Silhouette from there.
    * Apply Silhouette to the benchmark-assigned cell types.

## Create feature ranking dataframes

**This process takes ~3 hours**

In [None]:
save_dir = os.getcwd() + '/exports/comparisons/'

In [None]:
mereu_dir = os.getcwd() + '/data/Mereu_2020/'

for libprep in tqdm(['10X', 'CELseq2', 'ddSEQ', 'Dropseq', 'inDrop', 'QUARTZseq', 'SingleNuclei', 'SMARTseq2']):
    for org in ['human', 'mouse']:
        if os.path.exists(save_dir + f'mereu_{libprep}_{org}-log_feature_values.csv'):
            print(f'{libprep}, {org} exists!')
        else:
            adata_libprep = sc.read(mereu_dir + f'{libprep}_{org}.h5ad')
            create_df_feature_ranking(adata_libprep, f'mereu_{libprep}_{org}-log', apply_log=True)

In [None]:
ding_dir = os.getcwd() + '/data/Ding_2020/'

for libprep in tqdm(['10X', 'CELseq2', 'Dropseq', 'inDrop', 'sci-RNA-seq', 'Seq-Well', 'SingleNuclei', 'SMARTseq2']):
    for org in ['human', 'mouse']:
        if os.path.exists(ding_dir + f'{libprep}_{org}.h5ad'):
            if os.path.exists(save_dir + f'ding_{libprep}_{org}-log_feature_values.csv'):
                print(f'{libprep}, {org} exists!')
            else:
                adata_libprep = sc.read(ding_dir + f'{libprep}_{org}.h5ad')
                create_df_feature_ranking(adata_libprep, f'ding_{libprep}_{org}-log', apply_log=True)

## Calculate scores

In [None]:
@ray.remote
def run_ARI_silhouette_rem(lib_prep, org, seed, lab, adata_dir, save_dir):
    if os.path.exists(adata_dir + f'{lib_prep}_{org}.h5ad'):
        if os.path.exists(save_dir + f'{lab}_{lib_prep}-log_{org}_comparison-scores_seed-{seed}.csv'):
            print(f'{lib_prep}, {org}, {seed} exists!')
        else:
            adata = sc.read_h5ad(adata_dir + f'{lib_prep}_{org}.h5ad')
            print(adata)
            cell_type = 'cell_types' if 'cell_types' in adata.obs else 'CellType' # Somwhere I've fucked up with column name. Don't care where honestly.
            df_rank = pd.read_csv(os.getcwd() + f'/exports/comparisons/{lab}_{lib_prep}_{org}-log_feature_ranks.csv', index_col=0)

            biological_silhouette_ARI_table(adata, df_rank, outdir=save_dir, file_root=f'{lab}_{lib_prep}_{org}-log', seed=seed, 
                                                        cell_types_col=cell_type, n_procs=1)   
    else:
        print(adata_dir + f'{lib_prep}_{org}.h5ad does not exist!')

In [None]:
# Mereu's datasets
save_dir = os.getcwd() + '/exports/comparisons/'
adata_dir = data_dir + 'Mereu_2020/'


lib_preps = ['SingleNuclei', 'Dropseq', 'inDrop', '10X', 'SMARTseq2', 'CELseq2', 'QUARTZseq'] 
orgs = ['mouse', 'human'] 
result = list(product(*[lib_preps, orgs, range(5)]))

ray.init(ignore_reinit_error=True, num_cpus=min(len(result), os.cpu_count()))

list_id = [run_ARI_silhouette_rem.remote(lib_prep, org, seed, 'mereu', adata_dir, save_dir) for lib_prep, org, seed in result]
list_results = ray.get(list_id)

ray.shutdown()

In [None]:
# Ding's datasets
save_dir = os.getcwd() + '/exports/comparisons/'
adata_dir = data_dir + 'Ding_2020/'


lib_preps = ['10X', 'CELseq2', 'Dropseq', 'inDrop', 'sci-RNAseq', 'Seq-Well', 'SingleNuclei', 'SMARTseq2']
orgs = ['mouse', 'human'] 
result = list(product(*[lib_preps, orgs, range(5)]))

ray.init(ignore_reinit_error=True, num_cpus=min(len(result), os.cpu_count()))

list_id = [run_ARI_silhouette_rem.remote(lib_prep, org, seed, 'ding', adata_dir, save_dir) for lib_prep, org, seed in result]
list_results = ray.get(list_id)

ray.shutdown()

#### Figure 4

In [None]:
for lab in ['mereu', 'ding']:
    save_dir = os.getcwd() + '/exports/comparisons/'
    plot_lab_org_comparison_scores(lab, org='-log', read_dir=save_dir, variables=['NMI'], figsize=(16, 4), title=f'NMI on {lab} datasets (log)', 
                                  filename=f'{lab}-NMI-log')

#### Figure 5

In [None]:
for lab in ['mereu', 'ding']:
    save_dir = os.getcwd() + '/exports/comparisons/'
    plot_lab_org_comparison_scores(lab, org='-log', read_dir=save_dir, variables=['Sil_bench_all_hvg'], figsize=(16, 4), 
                                       title=f'Silhouette on {lab} datasets, cell types on selected features (log)',
                                       filename=f'{lab}-silhouette_selected features_celltypes-log')

#### Supplementary Figure 1

In [None]:
for lab in ['mereu', 'ding']:
    save_dir = os.getcwd() + '/exports/comparisons/'
    plot_lab_org_comparison_scores(lab, org='-log', read_dir=save_dir, variables=['Sil_leiden_all_hvg'], figsize=(16, 4), 
                                   title=f'Silhouette on {lab} datasets, leiden clusters on selected features (log)', 
                                  filename=f'{lab}-silhouette_selected features_leiden-log')

## Explainability of results
Although we see that triku has promising results, we were striked at how std, scry and brennecke have such a big gap of scores with respect to triku, nbumi and m3drop. In this section we are going to apply some comprobation measures to see if we can know why the difference is so big. 

### Overlap heatmaps

In [None]:
def plot_heatmaps_jaccard(df_ranks, n_HVG, fig_save_dir='', title='', ax=None, lab='ding'):
    df_heatmap = pd.DataFrame(np.NaN, index=df_ranks.columns, columns=df_ranks.columns)
    
    for row_idx, row in enumerate(df_ranks.columns):
        for col_idx, col in enumerate(df_ranks.columns):
            if row_idx >= col_idx:
                row_names = set(df_ranks.sort_values(by=row).index[:n_HVG].values)
                col_names = set(df_ranks.sort_values(by=col).index[:n_HVG].values)
                
                jaccard = len(row_names & col_names)/len(row_names | col_names)
                df_heatmap.loc[row, col] = jaccard
    
    h = sns.heatmap(df_heatmap, cbar=False, ax=ax, annot=True)
    h.set_title(title)
    
    plt.tight_layout()
    for fmt in ['png', 'pdf']:
        plt.savefig(f'{fig_save_dir}/{lab}_heatmap_overlap_features.{fmt}', bbox_inches='tight')
    

In [None]:
df_ranks = pd.read_csv(os.getcwd() + '/exports/comparisons/ding_10X_mouse-log_feature_ranks.csv', index_col=0)
df_ranks = df_ranks[['triku_0', 'm3drop', 'nbumi', 'scanpy', 'brennecke', 'scry', 'std',]].rename(columns={'triku_0':'triku'})
df_ranks

plot_heatmaps_jaccard(df_ranks, n_HVG=3000, fig_save_dir=os.getcwd() + '/figures/comparison_figs', 
                      title='SN')

#### Figure 6

In [None]:
fig, axs = plt.subplots(1, 4, figsize=(4.3*4, 4.3))

# First is an NMI barplot
palette = ["#E73F74", "#7F3C8D", "#11A579", "#3969AC", "#F2B701",
        "#80BA5A", "#E68310", "#a0a0a0", "#505050",]

list_files = ['ding_10X_mouse-log_comparison-scores', 'ding_Dropseq_human-log_comparison-scores',
              'ding_Seq-Well_human-log_comparison-scores', ]

methods = ['triku', 'scanpy', 'std', 'scry', 'brennecke', 'm3drop', 'nbumi', 'all', 'random']
for libprep_idx, libprep in enumerate(list_files):
    pre = '' if libprep_idx == 0 else '_'
    for method_idx, method in enumerate(methods):
        list_y = []
        for seed in range(5):
            df = pd.read_csv(os.getcwd() + f'/exports/comparisons/{libprep}_seed-{seed}.csv', index_col=0)
            list_y.append(df.loc['NMI', method])

        axs[0].bar(
            libprep_idx + (method_idx - len(methods) // 2) * 0.09,
            np.mean(list_y), width=0.09, yerr=np.std(list_y), color=palette[method_idx], 
            label=pre + method,
        )
        
axs[0].set_xticks([0, 1, 2])
axs[0].set_xticklabels(['10X mouse', 'Dropseq human', 'Seq-Well human'], rotation=45, ha='right')
axs[0].set_title('NMI on ding datasets')
axs[0].legend(ncol=2, handleheight=0.3, labelspacing=0.05, prop={'size': 8}, frameon=False)
axs[0].set(frame_on=False)

# Next are the heatmaps of gene overlap
df_ranks = pd.read_csv(os.getcwd() + '/exports/comparisons/ding_10X_mouse-log_feature_ranks.csv', index_col=0)
df_ranks = df_ranks[['triku_0', 'm3drop', 'nbumi', 'scanpy', 'brennecke', 'scry', 'std',]].rename(columns={'triku_0':'triku'})
plot_heatmaps_jaccard(df_ranks, n_HVG=3000, fig_save_dir=os.getcwd() + '/figures/comparison_figs', 
                      title='10X mouse', ax=axs[1])

df_ranks = pd.read_csv(os.getcwd() + '/exports/comparisons/ding_Dropseq_human-log_feature_ranks.csv', index_col=0)
df_ranks = df_ranks[['triku_0', 'm3drop', 'nbumi', 'scanpy', 'brennecke', 'scry', 'std',]].rename(columns={'triku_0':'triku'})
plot_heatmaps_jaccard(df_ranks, n_HVG=3000, fig_save_dir=os.getcwd() + '/figures/comparison_figs', 
                      title='Dropseq human', ax=axs[2])

df_ranks = pd.read_csv(os.getcwd() + '/exports/comparisons/ding_Seq-Well_human-log_feature_ranks.csv', index_col=0)
df_ranks = df_ranks[['triku_0', 'm3drop', 'nbumi', 'scanpy', 'brennecke', 'scry', 'std',]].rename(columns={'triku_0':'triku'})
plot_heatmaps_jaccard(df_ranks, n_HVG=3000, fig_save_dir=os.getcwd() + '/figures/comparison_figs', 
                      title='Seq-Well human', ax=axs[3])

# plt.tight_layout()

# Enrichment of ribosomal and mitochondrial genes

In [None]:
def barplot_mt_rbp(lab, org, method, n_features=[100, 250, 500, 1000], mode=0):
    list_FS = ['triku_0', 'scanpy', 'std', 'scry', 'brennecke', 'm3drop', 'nbumi'] # std is missing
    palette = ["#E73F74","#7F3C8D","#11A579","#3969AC","#F2B701","#80BA5A","#E68310","#a0a0a0","#505050"]
    
    fig, axs = plt.subplots(2, 1, figsize=(10, 6))
    
    for n_feature_idx, n_feature in enumerate(n_features):
        for FS_idx, FS in enumerate(list_FS):
            df = pd.read_csv(os.getcwd() + f'/exports/comparisons/{lab}_{method}_{org}-log_feature_ranks.csv', index_col=0)

            set_rbp = set([i for i in df.index if (i.upper().startswith('RPS')) | (i.upper().startswith('RPL'))])
            set_mt = set([i for i in df.index if (i.upper().startswith('MT-'))])
            
            set_FS = set(df.sort_values(by=FS).index.tolist()[:n_feature])
            
            if mode == 0:
                axs[0].bar(n_feature_idx + (FS_idx - len(list_FS) // 2) / (len(list_FS) + 3) , 100 * len(set_rbp & set_FS)/len(set_FS), 
                        width = 0.1, color=palette[FS_idx])
                axs[1].bar(n_feature_idx + (FS_idx - len(list_FS) // 2) / (len(list_FS) + 3) , 100 * len(set_mt & set_FS)/len(set_FS), 
                        width = 0.1, color=palette[FS_idx])
            else:
                axs[0].bar(n_feature_idx + (FS_idx - len(list_FS) // 2) / (len(list_FS) + 3) , 100 * len(set_rbp & set_FS)/len(set_rbp), 
                        width = 0.1, color=palette[FS_idx])
                axs[1].bar(n_feature_idx + (FS_idx - len(list_FS) // 2) / (len(list_FS) + 3) , 100 * len(set_mt & set_FS)/len(set_rbp), 
                        width = 0.1, color=palette[FS_idx])
                
    for ax in axs:
        ax.set_xticks(range(len(n_features)))
        ax.set_xticklabels(n_features)
    
    if mode == 0:
        axs[0].set_ylabel('% ribosomal genes\n(from selected features)')
        axs[1].set_ylabel('% mitochondrial genes\n(from selected features)')
    else:
        axs[0].set_ylabel('% ribosomal genes\n(from all ribosomal genes)')
        axs[1].set_ylabel('% mitochondrial genes\n(from all mitochondrial genes)')
        
    legend_elements = [mpl.lines.Line2D([0], [0], marker="o", color=palette[0], label='triku')] + [
        mpl.lines.Line2D(
            [0], [0], marker="o", color=palette[j], label=list_FS[j]
        )
        for j in range(1, len(list_FS))
    ]
    axs[0].legend(handles=legend_elements, bbox_to_anchor=(1.2, 0.9))
    
    
def heatmap_mt_rbp(labs, orgs, methods, n_features=500):
    list_FS = ['triku_0', 'm3drop', 'nbumi', 'scanpy', 'std', 'scry', 'brennecke', ] # std is missing
    palette = ["#E73F74","#80BA5A","#E68310","#7F3C8D","#11A579","#3969AC","#F2B701","#a0a0a0","#505050"]
        
    dict_info = {}
    
    for lab in labs:
        for org in orgs:
            for method in methods:
                for FS_idx, FS in enumerate(list_FS):
                    if not os.path.exists(os.getcwd() + f'/exports/comparisons/{lab}_{method}_{org}-log_feature_ranks.csv'):
                        continue
                        
                    df = pd.read_csv(os.getcwd() + f'/exports/comparisons/{lab}_{method}_{org}-log_feature_ranks.csv', index_col=0)

                    set_rbp = set([i for i in df.index if (i.upper().startswith('RPS')) | (i.upper().startswith('RPL'))])
                    set_mt = set([i for i in df.index if (i.upper().startswith('MT-'))])

                    set_FS = set(df.sort_values(by=FS).index.tolist()[:n_features])
                    
                    for opt in [f'{FS}_per_rbp_all_features', f'{FS}_per_mt_all_features',]:
                        if opt not in dict_info:
                            dict_info[opt] = []
                    
                    dict_info[f'{FS}_per_rbp_all_features'].append(100 * len(set_rbp & set_FS) / len(set_FS))
                    dict_info[f'{FS}_per_mt_all_features'].append(100 * len(set_mt & set_FS) / len(set_FS))
    
    df = pd.DataFrame(index=['triku'] + list_FS[1:], columns=[
        'Percentage RBPs in selected features', 'Percentage MTs in selected features', ])
    
    for FS_idx, FS in enumerate(list_FS):
        df.iloc[FS_idx, 0] = '%.3f' % np.nanmean(dict_info[f'{FS}_per_rbp_all_features'])
        df.iloc[FS_idx, 1] = '%.3f' % np.nanmean(dict_info[f'{FS}_per_mt_all_features'])
                                
                                     
    return df.astype(float)

#### Table 1

In [None]:
df = heatmap_mt_rbp(['mereu'], ['human', 'mouse'], ['SingleNuclei', 'Dropseq', 'inDrop', '10X', 'SMARTseq2', 
                                              'CELseq2', 'QUARTZseq', 'sci-RNAseq', 'Seq-Well'], n_features=1000)

display(df)

In [None]:
df = heatmap_mt_rbp(['ding'], ['human', 'mouse'], ['SingleNuclei', 'Dropseq', 'inDrop', '10X', 'SMARTseq2', 
                                              'CELseq2', 'QUARTZseq', 'sci-RNAseq', 'Seq-Well'], n_features=1000)

display(df)

# Gene ontology analysis

To see which method is better, a possible idea is to run Enrichr with the selected features, and use it to compare the FS methods. If the ontologies from one method have better p-values/scores, it is likely that they are more representative of the dataset.


In [None]:
import gseapy

In [None]:
os.makedirs(os.getcwd() + f'/exports/enrichr/', exist_ok=True)

In [None]:
list_onto_mouse = ['KEGG_2019_Mouse', 'WikiPathways_2019_Mouse', 'GO_Biological_Process_2018', 'GO_Cellular_Component_2018', 
                   'GO_Molecular_Function_2018',]

list_onto_human = ['KEGG_2019_Human', 'WikiPathways_2019_Human', 'GO_Biological_Process_2018', 'GO_Cellular_Component_2018', 
                   'GO_Molecular_Function_2018', ]

In [None]:
@ray.remote
def call_enrichr(lab, org, method, n_features, FS):
    if os.path.exists(os.getcwd() + f'/exports/enrichr/{lab}_{method}_{org}_{n_features}_{FS}.csv'):
        print(os.getcwd() + f'/exports/enrichr/{lab}_{method}_{org}_{n_features}_{FS}.csv EXISTS!')
        return None
    
    if not os.path.exists(os.getcwd() + f'/exports/comparisons/{lab}_{method}_{org}-log_feature_ranks.csv'):
        return None
    
    df_file = pd.read_csv(os.getcwd() + f'/exports/comparisons/{lab}_{method}_{org}-log_feature_ranks.csv', index_col=0)
    
    list_genes = df_file.sort_values(by=FS).index.tolist()[:n_features]
    list_onto = list_onto_mouse if org == 'mouse' else list_onto_human
    
    n_trials = 0
    
    while n_trials < 5:
        try:
            result_df = gseapy.enrichr(list_genes, list_onto, cutoff=1, organism=org).results
            if FS == 'triku_0':
                FS = 'triku'
            result_df.to_csv(os.getcwd() + f'/exports/enrichr/{lab}_{method}_{org}_{n_features}_{FS}.csv', index=None)
            n_trials += 10
        except:
            n_trials += 1
            print(f'TRIAL {n_trials}')
            
        

    

In [None]:
list_comb = list(product(*[['ding', 'mereu'], 
                           ['human', 'mouse'], 
                           ['SingleNuclei', 'Dropseq', 'inDrop', '10X', 'SMARTseq2', 'CELseq2', 'QUARTZseq', 'sci-RNAseq', 'Seq-Well'], 
                           [100, 250, 500, 1000, 1250, 1500], 
                           ['triku_0', 'scanpy', 'std', 'scry', 'brennecke', 'm3drop', 'nbumi']]))


ray.init(ignore_reinit_error=True)

list_id = [call_enrichr.remote(lab, org, method, n_features, FS) for lab, org, method, n_features, FS in list_comb]
list_results = ray.get(list_id)

ray.shutdown()

In [None]:
def scatter_enrichr(lab, org, method, n_features, n_ontologies=30, column_sort='Adjusted P-value', plot_type='bar', 
                    list_onto=['KEGG_2019_Mouse', 'WikiPathways_2019_Mouse', 'KEGG_2019_Human', 'WikiPathways_2019_Human',
                               'GO_Biological_Process_2018', 'GO_Cellular_Component_2018', 'GO_Molecular_Function_2018',], save=True):
    list_FS = ['triku', 'scanpy', 'std', 'scry', 'brennecke', 'm3drop', 'nbumi']
    palette = ["#E73F74","#7F3C8D","#11A579","#3969AC","#F2B701","#80BA5A","#E68310","#a0a0a0","#505050"]
    
    fig, ax = plt.subplots(1, 1, figsize=(10, 3))
    
    dict_dfs = {}
    
    for n_feature_idx, n_feature in enumerate(n_features):
        for FS_idx, FS in enumerate(list_FS):
            df = pd.read_csv(os.getcwd() + f'/exports/enrichr/{lab}_{method}_{org}_{n_feature}_{FS}.csv')
            df = df[df['Gene_set'].isin(list_onto)]
            
            if column_sort == 'Adjusted P-value':
                df = df.sort_values(by=column_sort).iloc[:n_ontologies]
                y_vals = df[column_sort].values
                y_vals = - np.log10(y_vals)

            elif column_sort == 'Combined Score':
                df = df.sort_values(by=column_sort, ascending=False).iloc[:n_ontologies]
                y_vals = df[column_sort].values
            
            elif column_sort == 'division':
                table_vals = df['Overlap'].values
                df['divided'] = [int(i.split('/')[0]) / int(i.split('/')[1]) for i in table_vals]
                df = df.sort_values(by='divided', ascending=False).iloc[:n_ontologies]
                y_vals = df['divided'].values
                
            
            x_pos = n_feature_idx + (FS_idx - len(list_FS) // 2) / (len(list_FS) + 3)
            
            if plot_type == 'bar':
                plt.bar(x_pos , np.mean(y_vals), 
                        width = 0.1, yerr=np.std(y_vals), color=palette[FS_idx])
            elif plot_type == 'scatter':
                plt.scatter([x_pos] * len(y_vals), y_vals, c=palette[FS_idx], alpha=0.8)
            
            dict_dfs[f'{n_feature}_{FS}'] = df
    
    legend_elements = [
        mpl.lines.Line2D(
            [0], [0], marker="o", color=palette[j], label=list_FS[j]
        )
        for j in range(len(list_FS))
    ]
    ax.legend(handles=legend_elements, bbox_to_anchor=(1.2, 0.9))
    ax.set_xticks(range(len(n_features)))
    ax.set_xticklabels(n_features)
    ax.set_ylabel(column_sort)
    ax.set_xlabel('Number of features')
    
    plt.tight_layout()
    
    if save:
        plt.savefig(save)
    
    return dict_dfs
    

In [None]:
def barplot_ontologies_individual(df, axis=None, color="#ababab", column='Adjusted P-value', ascending=False, log=True, y_text=''):
    if axis is None:
        fig, axis = plt.subplots(1, 1, figsize=(10, 7))
    
    vals = df.sort_values(by=column, ascending=ascending)[column].values
    names = [i.split(' (')[0] for i in df.sort_values(by=column, ascending=ascending)['Term'].values]
    names = [i[: 42] + '...' if len(i) > 42 else i for i in names]

    if log:
        vals = - np.log10(df.sort_values(by=column, ascending=ascending)[column].values)
    
    if column == 'Adjusted P-value':
        if log:
            axis.plot(-np.log10([0.05, 0.05]), [-1.5, len(names) + 0.5], c="#ababab", alpha=0.8, linewidth=3, zorder=0)
        else:
            axis.plot([0.05, 0.05], [-1.5, len(names) + 0.5], c="#ababab", alpha=0.8, linewidth=3, zorder=0)
        
    axis.barh(range(len(df)), vals, color=color, zorder=5)
    
    for y in range(len(df)):
        axis.text(0.05 * np.max(axis.get_xlim()), y - 0.2, names[y], zorder=10, fontsize=12)
        
    axis.set_yticks([])
    axis.spines['right'].set_visible(False)
    axis.spines['top'].set_visible(False)

    x_text = column if not log else column + ' (log)'
    axis.set_xlabel(x_text)
    axis.set_ylabel(y_text)
    
    return axis


def barplot_ontologies_all(dict_dfs, n_features=1000, list_FSs=['triku', 'std', 'scry', 'scanpy', 'm3drop', 'nbumi'], 
                           list_colors=["#E73F74", "#11A579","#3969AC", "#7F3C8D", "#80BA5A","#E68310"], figsize=(17, 14), save=''):
    
    mpl.rcParams.update({'font.size':17})
    fig, axis = plt.subplots(2, 3, figsize=figsize)
    
    for i in range(len(list_FSs)):
        barplot_ontologies_individual(dict_dfs[f'{n_features}_{list_FSs[i]}'], axis=axis.ravel()[i], 
                                      color=list_colors[i], column='Adjusted P-value', ascending=False, log=True, y_text=list_FSs[i])
    
    plt.tight_layout()
    
    if save:
        plt.savefig(save)
        
    mpl.rcParams.update(mpl.rcParamsDefault)

In [None]:
enrichr_figs_dir = os.getcwd() + '/figures/enrichr_figs/'
os.makedirs(enrichr_figs_dir, exist_ok=True)

In [None]:
# Good in Ding: Dropseq Human (Immune cells)

lab, org, method, n_features = 'ding', 'human', 'Dropseq', [100, 250, 500, 1000, 1250, 1500]
list_dfs_ding_human_dropseq = []
for x in ['Adjusted P-value']:  # ['Combined Score', 'Adjusted P-value', 'division']:
    dict_df = scatter_enrichr(lab, org, method, n_features, n_ontologies=25, column_sort=x, plot_type='scatter', 
                    list_onto=[ 'GO_Biological_Process_2018',], save=enrichr_figs_dir + f'scatter_{lab}_{org}_{method}_{x}.pdf')
    list_dfs_ding_human_dropseq.append(dict_df)
    
    

In [None]:
barplot_ontologies_all(list_dfs_ding_human_dropseq[0], save=enrichr_figs_dir + f'barplots_{lab}_{org}_{method}_{x}.pdf')

In [None]:
# Good in Mereu: Dropseq Mouse (Colon cells)

lab, org, method, n_features = 'mereu', 'mouse', 'Dropseq', [100, 250, 500, 1000, 1250, 1500]
list_dfs_mereu_mouse_dropseq = []
for x in ['Adjusted P-value']:  # ['Combined Score', 'Adjusted P-value', 'division']:
    dict_df = scatter_enrichr(lab, org, method, n_features, n_ontologies=25, column_sort=x, plot_type='scatter', 
                    list_onto=['GO_Biological_Process_2018',], save=enrichr_figs_dir + f'scatter_{lab}_{org}_{method}_{x}.pdf')
    list_dfs_mereu_mouse_dropseq.append(dict_df)

In [None]:
barplot_ontologies_all(list_dfs_mereu_mouse_dropseq[0], save=enrichr_figs_dir + f'barplots_{lab}_{org}_{method}_{x}.pdf')

In [None]:
# Bad in Ding: 10X Human (Immune cells)

lab, org, method, n_features = 'ding', 'human', '10X', [100, 250, 500, 1000, 1250, 1500]
list_dfs_ding_human_10x = []
for x in ['Adjusted P-value']:  # ['Combined Score', 'Adjusted P-value', 'division']:
    dict_df = scatter_enrichr(lab, org, method, n_features, n_ontologies=25, column_sort=x, plot_type='scatter', 
                    list_onto=['GO_Biological_Process_2018',], save=enrichr_figs_dir + f'scatter_{lab}_{org}_{method}_{x}.pdf')
    list_dfs_ding_human_10x.append(dict_df)

In [None]:
barplot_ontologies_all(list_dfs_ding_human_10x[0], save=enrichr_figs_dir + f'barplots_{lab}_{org}_{method}_{x}.pdf')

In [None]:
# Bad in Mereu: 10X Mouse (Colon cells)

lab, org, method, n_features = 'mereu', 'mouse', '10X', [100, 250, 500, 1000, 1250, 1500]
list_dfs_mereu_mouse_10x = []
for x in ['Adjusted P-value']:  # ['Combined Score', 'Adjusted P-value', 'division']:
    dict_df = scatter_enrichr(lab, org, method, n_features, n_ontologies=25, column_sort=x, plot_type='scatter', 
                    list_onto=['GO_Biological_Process_2018',], save=enrichr_figs_dir + f'scatter_{lab}_{org}_{method}_{x}.pdf')
    list_dfs_mereu_mouse_10x.append(dict_df)

In [None]:
barplot_ontologies_all(list_dfs_mereu_mouse_10x[0], save=enrichr_figs_dir + f'barplots_{lab}_{org}_{method}_{x}.pdf')

# 10X datasets
In this section we are going to analyze the 10x neuron, heart and pbmc datasets.
We are going to show the Silhouette scores for leiden using the methods.

In [None]:
_10x_dir = os.path.dirname(os.getcwd()) + '/data/10x/'
save_dir = os.getcwd() + '/exports/comparisons/'

In [None]:
for libprep in tqdm(['neuron', 'pbmc', 'heart', ]):
    if os.path.exists(save_dir + f'10X-datasets_{libprep}_feature_values.csv'):
        print(f'{libprep}, exists!')
    else:
        adata_libprep = sc.read_10x_h5(_10x_dir + f'{libprep}_10k_v3_raw_feature_bc_matrix.h5')
        adata_libprep.var_names_make_unique()
        sc.pp.filter_cells(adata_libprep, min_counts=400)
        sc.pp.filter_genes(adata_libprep, min_counts=100)
        adata_libprep.var_names_make_unique()
        adata_libprep.X = np.asarray(adata_libprep.X.todense())
        del [adata_libprep.var, adata_libprep.obs]

        print(adata_libprep.X)
        create_df_feature_ranking(adata_libprep, f'10X-datasets_nonlog_{libprep}')

In [None]:
save_dir = os.getcwd() + '/exports/comparisons/'

tissues = ['heart', 'pbmc', 'neuron']

for tissue in tissues:
    print(tissue)
    adata_libprep = sc.read_10x_h5(_10x_dir + f'{tissue}_10k_v3_raw_feature_bc_matrix.h5')
    adata_libprep.var_names_make_unique()
    sc.pp.filter_cells(adata_libprep, min_counts=400)
    sc.pp.filter_genes(adata_libprep, min_counts=100)
    adata_libprep.var_names_make_unique()
    adata_libprep.X = np.asarray(adata_libprep.X.todense())
    df_rank = pd.read_csv(os.getcwd() + f'/exports/comparisons/10X-datasets_nonlog_{tissue}_feature_ranks.csv', index_col=0)
    
    for seed in range(5):
        biological_silhouette_ARI_table(adata_libprep, df_rank, outdir=save_dir, file_root=f'10X-datasets_nonlog_{tissue}', seed=seed, 
                                                            cell_types_col=None, n_procs=1)



In [None]:
for libprep in tqdm(tissues):
    if os.path.exists(save_dir + f'10X-datasets_log_{libprep}_feature_values.csv'):
        print(f'{libprep}, exists!')
    else:
        adata_libprep = sc.read_10x_h5(_10x_dir + f'{libprep}_10k_v3_raw_feature_bc_matrix.h5')
        adata_libprep.var_names_make_unique()
        sc.pp.filter_cells(adata_libprep, min_counts=400)
        sc.pp.filter_genes(adata_libprep, min_counts=100)
        adata_libprep.var_names_make_unique()
        adata_libprep.X = np.asarray(adata_libprep.X.todense())
        del [adata_libprep.var, adata_libprep.obs]
        
        sc.pp.log1p(adata_libprep)
        
        print(adata_libprep.X)
        create_df_feature_ranking(adata_libprep, f'10X-datasets_log_{libprep}')

In [None]:
save_dir = os.getcwd() + '/exports/comparisons/'

tissues = ['heart', 'pbmc', 'neuron']

for tissue in tissues:
    print(tissue)
    adata_libprep = sc.read_10x_h5(_10x_dir + f'{tissue}_10k_v3_raw_feature_bc_matrix.h5')
    adata_libprep.var_names_make_unique()
    sc.pp.filter_cells(adata_libprep, min_counts=400)
    sc.pp.filter_genes(adata_libprep, min_counts=100)
    adata_libprep.var_names_make_unique()
    adata_libprep.X = np.asarray(adata_libprep.X.todense())
        
    sc.pp.log1p(adata_libprep)

    df_rank = pd.read_csv(os.getcwd() + f'/exports/comparisons/10X-datasets_log_{tissue}_feature_ranks.csv', index_col=0)
    
    for seed in range(5):
        biological_silhouette_ARI_table(adata_libprep, df_rank, outdir=save_dir, file_root=f'10X-datasets_log_{tissue}', seed=seed, 
                                                            cell_types_col=None, n_procs=1)



In [None]:
lab = '10X-datasets'

plot_lab_org_comparison_scores(lab=lab, org='-', read_dir=save_dir, variables=['Sil_leiden_PCA'], figsize=(14, 4), 
                                   title=f'Silhouette on {lab} datasets, leiden clusters on PCA projection', 
                                  filename=f'{lab}-Silhouette_PCA_leiden')

plot_lab_org_comparison_scores(lab=lab, org='-', read_dir=save_dir, variables=['Sil_leiden_all_hvg'], figsize=(14, 4), 
                                   title=f'Silhouette on {lab} datasets, leiden clusters on selected features', 
                                  filename=f'{lab}-Silhouette_selected features_celltypes')

In [None]:
os.getcwd()

In [None]:
from bokeh.io import show, output_notebook, reset_output
from bokeh.plotting import figure
from bokeh.models import Circle, ColumnDataSource, Div, Grid, Line, LinearAxis, Plot, Range1d, Legend, LinearColorMapper, BasicTicker, PrintfTickFormatter, ColorBar
from bokeh.sampledata.unemployment1948 import data

reset_output()
output_notebook()

In [None]:
mpl.rc('font', **{'size': 13})

In [None]:
def return_mean_per(matrix):
    # Returns the mean counts per gene, and the proportion of zeros
    n_reads_per_gene = matrix.sum(0).astype(int)
    n_zeros = (matrix == 0).sum(0)

    return n_reads_per_gene/matrix.shape[0], n_zeros/matrix.shape[0]

In [None]:
export_dir = os.getcwd() + '/exports/comparisons/'
adata_dir = os.getcwd().replace('notebooks', '') + '/data/Ding_2020/'

In [None]:
org, dataset, res = 'human', 'SMARTseq2', 1

In [None]:
rank = f'ding_{dataset}_{org}-log_feature_ranks.csv'
values = f'ding_{dataset}_{org}-log_feature_values.csv'
adata_name = f'{dataset}_{org}.h5ad'

In [None]:
df_ranks = pd.read_csv(export_dir + rank, index_col=0)
df_values = pd.read_csv(export_dir + values, index_col=0)
adata = sc.read_h5ad(adata_dir + adata_name)
adata = adata[:, df_ranks.index]

In [None]:
adata = sc.read_h5ad(adata_dir + adata_name)
adata = adata[:, df_ranks.index]
sc.pp.log1p(adata)
sc.pp.pca(adata)
sc.pp.neighbors(adata)
sc.tl.umap(adata)
sc.tl.leiden(adata, resolution=res)

In [None]:
plt.scatter(df_ranks['triku_0'], df_values['triku_0'])

In [None]:
list_mean_exp, list_p_zeros = return_mean_per(adata.X)
list_genes = adata.var_names

col_triku, col_rest = 'triku_0', 'm3drop'
colors = ['#000000', '#007ab7', '#b7007a', '#bcbcbc']
alphas = [0.9, 0.7, 0.7, 0.3]
cutoff = 1750

fs_triku = df_ranks[df_ranks[col_triku] < cutoff].index
fs_rest = df_ranks[df_ranks[col_rest] < cutoff].index

idx_cols = []
for var in list_genes:
    if (var in fs_triku) and (var in fs_rest):
        idx_cols.append(0)
    elif (var in fs_triku) and (var not in fs_rest):
        idx_cols.append(1)
    elif (var not in fs_triku) and (var in fs_rest):
        idx_cols.append(2)
    else:
        idx_cols.append(3)
idx_cols = np.array(idx_cols)        

df_bokeh = pd.DataFrame({'m': np.log10(list_mean_exp), 
                         'z': list_p_zeros,
                         'n': list_genes, 
                         'r_t': df_ranks[col_triku].values, 
                         'd_t': df_values[col_triku].values,
                         'd_n': df_values[col_rest].values,
                         'r_n': df_ranks[col_rest].values, 
                         'color': [colors[i] for i in idx_cols], 
                         'alphas': [alphas[i] for i in idx_cols]})

df_bokeh = df_bokeh.dropna(how='any')

In [None]:
col_color = 'd_t'
p = figure(tools="box_zoom,hover,reset", plot_height=600, plot_width=600, 
           tooltips=[("Gene","@n"), 
                     ('Rank | Value TRIKU', f'@r_t | @d_t'), 
                     ('Rank | Value OTHER', f'@r_n | @d_n'), ])

color_map = LinearColorMapper(low=min(df_bokeh[col_color].values), 
                              high=np.percentile(df_bokeh[col_color].values, 95), 
                              palette='Viridis256')



p.scatter('m', 'z', source=df_bokeh, fill_color={'field': col_color, 
                                                 'transform':color_map}, 
          alpha=0.7, line_color=None)

from bokeh.models import ColorBar
bar = ColorBar(color_mapper=color_map, location=(0,0))
p.add_layout(bar, "left")
show(p)

In [None]:
col_color = 'd_t'
p = figure(tools="box_zoom,hover,reset", plot_height=600, plot_width=600, 
           tooltips=[("Gene","@n"), 
                     ('Rank | Value TRIKU', f'@r_t | @d_t'), 
                     ('Rank | Value OTHER', f'@r_n | @d_n'), ])

print(f'''TRIKU + REST: {len(idx_cols[idx_cols == 0])}
        TRIKU: {len(idx_cols[idx_cols == 1])}
        REST: {len(idx_cols[idx_cols == 2])}
        NONE: {len(idx_cols[idx_cols == 3])}''')
p.scatter('m', 'z', source=df_bokeh, fill_color='color', alpha='alphas', line_color=None)

show(p)

In [None]:
# sc.pl.umap(adata, color=['leiden', 'Gm39043'], cmap=magma, legend_loc='on data')

In [None]:
def get_norm_exp_cluster(adata, gene):
    expression_vals = adata[:, gene].X.ravel()
    expression_vals /= np.sum(expression_vals)
    exp_by_cluster = sorted([sum(expression_vals[adata.obs['leiden'] == str(i)]) for i 
                      in range(np.max(adata.obs['leiden'].astype(int)) + 1)])[::-1]
    return(exp_by_cluster)

In [None]:
fig1, axs1 = plt.subplots(2, 2)
fig, axs = plt.subplots(1, 1)

names = [f'triku + {col_rest}', 'triku', col_rest, 'None']
for idx in range(4):
    list_clust = []
    list_genes = adata.var_names[idx_cols == idx]
    for gene in tqdm(sorted(list_genes[: min(2000, len(list_genes))])):
        exp_clust = get_norm_exp_cluster(adata, gene)
        list_clust.append(exp_clust)
    
    arr = np.array(list_clust)
    axs1.ravel()[idx].plot(np.arange(len(exp_clust)), 
                          100 * np.mean(arr, 0), color=colors[idx], alpha=1)
    axs1.ravel()[idx].fill_between(np.arange(len(exp_clust)),
                                  100 * np.percentile(arr, 85, axis=0), 
                                  100 * np.percentile(arr, 15, axis=0), 
                                  color=colors[idx], alpha=0.3)
    
    axs1.ravel()[idx].set_ylim([0, 0.8])
    
    axs.plot(np.arange(len(exp_clust)), 100 * np.mean(arr, 0), color=colors[idx], alpha=1, label=names[idx])
    
axs.set_xticks(np.arange(len(exp_clust)))
axs.set_xticklabels(np.arange(len(exp_clust)))

axs.set_xlabel('Cluster (most to least expressed)')
axs.set_ylabel('% of expression')

plt.legend()

for f in [fig, fig1]:
    f.tight_layout()
    
fig.savefig(os.getcwd() + f'/figures/comparison_figs/pdf/clusters_{org}_{dataset}_res-{res}_method-{col_rest}_all.pdf')
fig1.savefig(os.getcwd() + f'/figures/comparison_figs/pdf/clusters_{org}_{dataset}_res-{res}_method-{col_rest}_individual.pdf')

In [None]:
for org in ['human', 'mouse']:
    for dataset in ['SMARTseq2', '10X']:
        for col_rest in ['m3drop', 'nbumi']:
            print(org, dataset, col_rest)
            res = 1
            
            rank = f'ding_{dataset}_{org}-log_feature_ranks.csv'
            values = f'ding_{dataset}_{org}-log_feature_values.csv'
            adata_name = f'{dataset}_{org}.h5ad'
            
            df_ranks = pd.read_csv(export_dir + rank, index_col=0)
            df_values = pd.read_csv(export_dir + values, index_col=0)
            adata = sc.read_h5ad(adata_dir + adata_name)
            adata = adata[:, df_ranks.index]

            adata = sc.read_h5ad(adata_dir + adata_name)
            adata = adata[:, df_ranks.index]
            sc.pp.log1p(adata)
            sc.pp.pca(adata)
            sc.pp.neighbors(adata)
            sc.tl.umap(adata)
            sc.tl.leiden(adata, resolution=res)
            
            
            list_mean_exp, list_p_zeros = return_mean_per(adata.X)
            list_genes = adata.var_names

            col_triku = 'triku_0'
            colors = ['#000000', '#007ab7', '#b7007a', '#bcbcbc']
            alphas = [0.9, 0.7, 0.7, 0.3]
            cutoff = 1750

            fs_triku = df_ranks[df_ranks[col_triku] < cutoff].index
            fs_rest = df_ranks[df_ranks[col_rest] < cutoff].index

            idx_cols = []
            for var in list_genes:
                if (var in fs_triku) and (var in fs_rest):
                    idx_cols.append(0)
                elif (var in fs_triku) and (var not in fs_rest):
                    idx_cols.append(1)
                elif (var not in fs_triku) and (var in fs_rest):
                    idx_cols.append(2)
                else:
                    idx_cols.append(3)
            idx_cols = np.array(idx_cols)        

            df_bokeh = pd.DataFrame({'m': np.log10(list_mean_exp), 
                                     'z': list_p_zeros,
                                     'n': list_genes, 
                                     'r_t': df_ranks[col_triku].values, 
                                     'd_t': df_values[col_triku].values,
                                     'd_n': df_values[col_rest].values,
                                     'r_n': df_ranks[col_rest].values, 
                                     'color': [colors[i] for i in idx_cols], 
                                     'alphas': [alphas[i] for i in idx_cols]})

            df_bokeh = df_bokeh.dropna(how='any')
            
            
            
            
            
            fig1, axs1 = plt.subplots(2, 2)
            fig, axs = plt.subplots(1, 1)

            names = [f'triku + {col_rest}', 'triku', col_rest, 'None']
            for idx in range(4):
                list_clust = []
                list_genes = adata.var_names[idx_cols == idx]
                for gene in tqdm(sorted(list_genes[: min(2000, len(list_genes))])):
                    exp_clust = get_norm_exp_cluster(adata, gene)
                    list_clust.append(exp_clust)

                arr = np.array(list_clust)
                axs1.ravel()[idx].plot(np.arange(len(exp_clust)), 
                                      100 * np.mean(arr, 0), color=colors[idx], alpha=1)
                axs1.ravel()[idx].fill_between(np.arange(len(exp_clust)),
                                              100 * np.percentile(arr, 85, axis=0), 
                                              100 * np.percentile(arr, 15, axis=0), 
                                              color=colors[idx], alpha=0.3)

                axs1.ravel()[idx].set_ylim([0, 80])

                axs.plot(np.arange(len(exp_clust)), 100 * np.mean(arr, 0), color=colors[idx], alpha=1, label=names[idx])

            axs.set_xticks(np.arange(len(exp_clust)))
            axs.set_xticklabels(np.arange(len(exp_clust)))

            axs.set_xlabel('Cluster (most to least expressed)')
            axs.set_ylabel('% of expression')
                
            plt.legend()

            for f in [fig, fig1]:
                f.tight_layout()
            
            print(os.getcwd() + f'/figures/comparison_figs/pdf/clusters_{org}_{dataset}_res-{res}_method-{col_rest}_all.pdf')
            fig.savefig(os.getcwd() + f'/figures/comparison_figs/pdf/clusters_{org}_{dataset}_res-{res}_method-{col_rest}_all.pdf')
            fig1.savefig(os.getcwd() + f'/figures/comparison_figs/pdf/clusters_{org}_{dataset}_res-{res}_method-{col_rest}_individual.pdf')

In [None]:
os.get_

In [None]:
export_dir

In [None]:
len(adata.var_names)

In [None]:
len(df_ranks)

In [None]:
for lab in ['mereu', 'ding']:
    if lab == 'mereu':
        adata_dir = os.getcwd().replace('notebooks', '') + '/data/Mereu_2020/'
    elif lab == 'ding':
        adata_dir = os.getcwd().replace('notebooks', '') + '/data/Ding_2020/'
        
    for org in ['mouse', 'human']:
        for dataset in ['Dropseq', 'CELseq2', 'inDrop', 'QUARTZseq', 'SingleNuclei', 'ddSEQ', 'SMARTseq2', '10X',]:      
            print(lab, org, dataset)
            
            fig, axs = plt.subplots(1, 1)
            FS_names = ['triku', 'scanpy', 'std', 'scry', 'brennecke', 'm3drop', 'nbumi', 'all', 'random']

            colors = ["#E73F74","#7F3C8D","#11A579", "#3969AC","#F2B701","#80BA5A","#E68310","#a0a0a0","#505050",]
            cutoff, res = 1750, 1.2

            rank = f'{lab}_{dataset}_{org}-log_feature_ranks.csv'
            values = f'{lab}_{dataset}_{org}-log_feature_values.csv'
            adata_name = f'{dataset}_{org}.h5ad'

            if not (os.path.exists(adata_dir + adata_name) and os.path.exists(export_dir + rank)):
                print(adata_dir + adata_name, os.path.exists(export_dir + rank))
                print(export_dir + rank, os.path.exists(export_dir + rank))
                print('Combo does not exist!')
                continue
            
            df_ranks = pd.read_csv(export_dir + rank, index_col=0)
            adata = sc.read_h5ad(adata_dir + adata_name)
            
            combined_names = np.intersect1d(df_ranks.index.values, adata.var_names)
            adata = adata[:, combined_names]
            df_ranks = df_ranks.loc[combined_names]

            sc.pp.log1p(adata)
            sc.pp.pca(adata)
            sc.pp.neighbors(adata)
            sc.tl.umap(adata)
            sc.tl.leiden(adata, resolution=res)

            for col_rest_idx, col_rest in enumerate(FS_names):
                print(col_rest)
                list_mean_exp, list_p_zeros = return_mean_per(adata.X)
                list_genes = adata.var_names

                if col_rest not in ['all', 'random', 'triku']:
                    list_genes = df_ranks[df_ranks[col_rest] < cutoff].index
                elif col_rest == 'all':
                    list_genes = df_ranks.index
                elif col_rest == 'random':
                    list_genes = np.random.choice(df_ranks.index, cutoff) 
                elif col_rest == 'triku':
                    list_genes = df_ranks[df_ranks['triku_0'] < cutoff].index

                list_clust = []
                for gene in list_genes:
                    exp_clust = get_norm_exp_cluster(adata, gene)
                    list_clust.append(exp_clust)

                arr = np.array(list_clust)

                axs.plot(np.arange(len(exp_clust)), 100 * np.mean(arr, 0), color=colors[col_rest_idx], alpha=1, 
                         label=col_rest, zorder=len(FS_names) - col_rest_idx)

            axs.set_xticks(np.arange(len(exp_clust)))
            axs.set_xticklabels(np.arange(len(exp_clust)))

            axs.set_xlabel('Cluster (most to least expressed)')
            axs.set_ylabel('% of expression')

            plt.suptitle(f'{org} {dataset}')
            plt.legend()


            print(os.getcwd() + f'/figures/comparison_figs/pdf/clusters_{lab}_{org}_{dataset}_res-{res}-all_methods.pdf')
            fig.savefig(os.getcwd() + f'/figures/comparison_figs/pdf/clusters_{lab}_{org}_{dataset}_res-{res}-all_methods.pdf')


In [None]:
adata_dir = os.getcwd().replace('notebooks', '') + '/data/splatter/'
for deprob in ['0.3', '0.1', '0.05', '0.025', '0.016', '0.01', '0.008', '0.0065']:      
    print(deprob)

    fig, axs = plt.subplots(1, 1)
    FS_names = ['triku', 'scanpy', 'std', 'scry', 'brennecke', 'm3drop', 'nbumi', 'all', 'random']

    colors = ["#E73F74","#7F3C8D","#11A579", "#3969AC","#F2B701","#80BA5A","#E68310","#a0a0a0","#505050",]
    cutoff, res = 500, 0.6

    rank = f'scatter_{deprob}_feature_ranks.csv'
    adata_name = f'splatter_deprob_{deprob}.loom'

    df_ranks = pd.read_csv(export_dir + rank, index_col=0)
    adata = sc.read_loom(adata_dir + adata_name)
    adata.X = np.asarray(adata.X.todense())

    combined_names = np.intersect1d(df_ranks.index.values, adata.var_names)
    adata = adata[:, combined_names]
    
    df_ranks = df_ranks.loc[combined_names]
    
    print(adata, adata.X)
    sc.pp.log1p(adata)
    sc.pp.pca(adata)
    sc.pp.neighbors(adata)
    sc.tl.umap(adata)
    sc.tl.leiden(adata, resolution=res)

    for col_rest_idx, col_rest in enumerate(FS_names):
        print(col_rest)
        list_mean_exp, list_p_zeros = return_mean_per(adata.X)
        list_genes = adata.var_names

        if col_rest not in ['all', 'random', 'triku']:
            list_genes = df_ranks[df_ranks[col_rest] < cutoff].index
        elif col_rest == 'all':
            list_genes = df_ranks.index
        elif col_rest == 'random':
            list_genes = np.random.choice(df_ranks.index, cutoff) 
        elif col_rest == 'triku':
            list_genes = df_ranks[df_ranks['triku_0'] < cutoff].index

        list_clust = []
        for gene in list_genes:
            exp_clust = get_norm_exp_cluster(adata, gene)
            list_clust.append(exp_clust)

        arr = np.array(list_clust)

        axs.plot(np.arange(len(exp_clust)), 100 * np.mean(arr, 0), color=colors[col_rest_idx], alpha=1, label=col_rest, 
                 zorder=len(FS_names) - col_rest_idx)

    axs.set_xticks(np.arange(len(exp_clust)))
    axs.set_xticklabels(np.arange(len(exp_clust)))

    axs.set_xlabel('Cluster (most to least expressed)')
    axs.set_ylabel('% of expression')

    plt.suptitle(f'{org} {dataset}')
    plt.legend()


    print(os.getcwd() + f'/figures/comparison_figs/pdf/clusters_splatter_{deprob}_res-{res}-all_methods.pdf')
    fig.savefig(os.getcwd() + f'/figures/comparison_figs/pdf/clusters_splatter_{deprob}_res-{res}-all_methods.pdf')
