# Using `memento` to analyze Interferon-B response in monocytes

To install `memento` in the pre-release version (for Ye Lab members), install it directly from github by running:

```pip install git+https://github.com/yelabucsf/scrna-parameter-estimation.git@release-v0.0.2```

This requires that you have access to the Ye Lab organization. 

In [1]:
import sys
sys.path.append('/data/home/Github/scrna-parameter-estimation/dist/memento-0.0.1-py3.7.egg')
import memento

  import pandas.util.testing as tm


In [2]:
import scanpy as sc
import memento

In [3]:
fig_path = '/data/home/Github/scrna-parameter-estimation/figures/fig4/'
data_path = '/data_volume/parameter_estimation/'

### Read IFN data and filter for monocytes

For `memento`, we need the raw count matrix. Preferrably, feed the one with all genes so that we can choose what genes to look at. 

One of the columns in `adata.obs` should be the discrete groups to compare mean, variability, and co-variability across. In this case, it's called `stim`. 

The column containing the covariate that you want p-values for should either:
- Be binary (aka the column only contains two unique values, such as 'A' and 'B'. Here, the values are either 'stim' or 'ctrl'.
- Be numeric (aka the column contains -1, 0, -1 for each genotype value). 

I recommend changing the labels to something numeric (here, i use 0 for `ctrl` and 1 for `stim`). Otherwise, the sign of the DE/EV/DC testing will be very hard to interpret.

In [4]:
adata = sc.read(data_path + 'interferon_filtered.h5ad')
adata = adata[adata.obs.cell == 'CD14+ Monocytes'].copy()
print(adata)

AnnData object with n_obs × n_vars = 5341 × 35635
    obs: 'tsne1', 'tsne2', 'ind', 'stim', 'cluster', 'cell', 'multiplets', 'n_genes_by_counts', 'log1p_n_genes_by_counts', 'total_counts', 'log1p_total_counts', 'total_counts_mt', 'log1p_total_counts_mt', 'pct_counts_mt', 'total_counts_hb', 'log1p_total_counts_hb', 'pct_counts_hb', 'cell_type'
    var: 'gene_ids', 'mt', 'hb', 'n_cells_by_counts', 'mean_counts', 'log1p_mean_counts', 'pct_dropout_by_counts', 'total_counts', 'log1p_total_counts'
    uns: 'cell_type_colors'
    obsm: 'X_tsne'


  if not is_categorical(df_full[k]):


In [5]:
adata.obs['stim'] = adata.obs['stim'].apply(lambda x: 0 if x == 'ctrl' else 1)

In [6]:
adata.obs[['ind', 'stim', 'cell']].sample(5)

Unnamed: 0_level_0,ind,stim,cell
index,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1
ACAACCGAAGCATC-1,1256,0,CD14+ Monocytes
GATCTTACATCACG-1,1256,0,CD14+ Monocytes
GCCCATACGAATGA-1,1256,0,CD14+ Monocytes
CAGTTACTTGAACC-1,101,0,CD14+ Monocytes
CCTGACTGGTCACA-1,101,0,CD14+ Monocytes


### Create groups for hypothesis testing and compute 1D parameters

`memento` creates groups of cells based on anything that should be considered a reasonable group; here, we just divide the cells into `stim` and `ctrl`. But we can easily further divide the cells into individuals by adding the `ind` column to the `label_columns` argument when calling `create_groups`.

`q` is the rough estimate of the overall UMI efficiency across both sampling and sequencing. If `s` is the sequencing saturation, multiply `s` by 0.07 for 10X v1, 0.15 for v2, and 0.25 for v3. 

By default, `memento` will consider all genes whose expression is high enough to calculate an accurate variance. If you wish to include less genes, increase `filter_mean_thresh`.

In [7]:
memento.create_groups(adata, label_columns=['stim'], inplace=True, q=0.07)

In [8]:
memento.compute_size_factors(adata)

In [9]:
memento.compute_1d_moments(
    adata, 
    inplace=True, 
    filter_mean_thresh=0.07, # minimum raw mean of each gene within a group for the gene to be considered 
    min_perc_group=.9) # percentage of groups that satisfy the condition for a gene to be considered. 

  if not is_categorical(df_full[k]):


### Perform 1D hypothesis testing

`formula_like` determines the linear model that is used for hypothesis testing, while `cov_column` is used to pick out the variable that you actually want p-values for. 

`num_cpus` controls how many CPUs to parallelize this operation for. In general, I recommend using 3-6 CPUs for reasonable peformance on any of the AWS machines that we have access to (I'm currently using a c5.2xlarge instance (8 vCPUs). 

In [10]:
memento.ht_1d_moments(
    adata, 
    formula_like='1 + stim',
    cov_column='stim', 
    num_boot=5000, 
    verbose=1,
    num_cpus=6)

[Parallel(n_jobs=6)]: Using backend LokyBackend with 6 concurrent workers.
[Parallel(n_jobs=6)]: Done  38 tasks      | elapsed:    7.9s
[Parallel(n_jobs=6)]: Done 188 tasks      | elapsed:   34.0s
[Parallel(n_jobs=6)]: Done 438 tasks      | elapsed:  1.3min
[Parallel(n_jobs=6)]: Done 788 tasks      | elapsed:  2.3min
[Parallel(n_jobs=6)]: Done 1238 tasks      | elapsed:  3.6min
[Parallel(n_jobs=6)]: Done 1788 tasks      | elapsed:  5.3min
[Parallel(n_jobs=6)]: Done 1877 out of 1877 | elapsed:  5.5min finished


In [11]:
result_1d = memento.get_1d_ht_result(adata)

In [12]:
result_1d.query('de_coef > 0').sort_values('de_pval').head(10)

Unnamed: 0,gene,de_coef,de_pval,dv_coef,dv_pval
1831,APOBEC3A,3.577028,1.498803e-21,-2.13362,2.2698430000000003e-17
646,PSMB9,1.322325,9.600557e-13,-1.146485,0.0008945763
690,FAM26F,3.258483,7.978794e-10,-0.700244,0.0084
1411,PSMA4,1.164044,1.703743e-09,-0.071111,0.6426
1729,BST2,1.615706,6.621337e-09,-1.129334,9.428445e-10
1041,IRF7,2.078901,7.703532e-09,-0.570913,3.096727e-07
647,TAP1,1.210972,2.284785e-08,-0.229118,0.121
811,SAT1,1.162465,3.363499e-08,0.423172,0.0064
1594,MYL12A,1.143497,3.451711e-08,-0.343865,0.0168
1869,MX1,3.608483,3.757651e-08,-1.447122,1.440663e-09


In [13]:
result_1d.sort_values('dv_pval').head(10)

Unnamed: 0,gene,de_coef,de_pval,dv_coef,dv_pval
1039,IFITM3,3.393203,1.609065e-07,-3.238808,9.362033e-49
1527,CCL2,1.477253,4.603379e-07,-1.726231,1.0720009999999999e-38
1421,ISG20,3.646966,7.05045e-05,-2.904924,2.461152e-37
37,IFI6,2.740365,4.441463e-05,-2.210914,3.676804e-28
915,LY6E,3.432749,4.415746e-06,-3.242748,8.332773e-27
876,IDO1,3.932135,1.085957e-06,-2.093306,5.154047e-23
1300,PSME2,0.816086,2.851907e-06,-1.008491,1.684524e-22
1376,B2M,0.301481,0.0001179484,-0.684679,1.826482e-22
0,ISG15,4.630542,3.331181e-07,-3.849747,1.636284e-20
1528,CCL7,2.127498,6.045517e-06,-1.119085,2.4433989999999997e-19


### Perform 2D hypothesis testing

For differential coexpression testing, we can specify which genes you want to perform HT on. It takes a list of pairs of genes, where each element in the list is a tuple. Here, we focus on 1 transcription factor and their correlations to rest of the transcriptome. 

Similar to the 1D case, 2D hypothesis testing scales with the number of pairs of genes to test. If you have a smaller set of candidate genes, it will run faster.

In [18]:
import itertools

In [19]:
gene_pairs = list(itertools.product(['IRF7'], adata.var.index.tolist()))

In [20]:
memento.compute_2d_moments(adata, gene_pairs)

In [31]:
memento.ht_2d_moments(
    adata, 
    formula_like='1 + stim', 
    cov_column='stim', 
    num_cpus=6, 
    num_boot=5000)

[Parallel(n_jobs=6)]: Using backend LokyBackend with 6 concurrent workers.
[Parallel(n_jobs=6)]: Done  20 tasks      | elapsed:    2.8s
[Parallel(n_jobs=6)]: Done 116 tasks      | elapsed:   14.8s
[Parallel(n_jobs=6)]: Done 276 tasks      | elapsed:   35.9s
[Parallel(n_jobs=6)]: Done 500 tasks      | elapsed:  1.1min
[Parallel(n_jobs=6)]: Done 788 tasks      | elapsed:  1.7min
[Parallel(n_jobs=6)]: Done 1140 tasks      | elapsed:  2.5min
[Parallel(n_jobs=6)]: Done 1556 tasks      | elapsed:  3.4min
[Parallel(n_jobs=6)]: Done 1876 out of 1876 | elapsed:  4.1min finished


In [32]:
result_2d = memento.get_2d_ht_result(adata)

In [33]:
result_2d.sort_values('corr_pval').head(10)

Unnamed: 0,gene_1,gene_2,corr_coef,corr_pval,corr_fdr
574,IRF7,CD74,0.316293,0.000123,0.073478
1815,IRF7,SDF2L1,0.304159,0.000181,0.073478
104,IRF7,GCLM,0.396597,0.000283,0.073478
716,IRF7,ACTB,0.272642,0.000334,0.073478
638,IRF7,HLA-DRA,0.252754,0.000473,0.073478
158,IRF7,LMNA,0.306795,0.00056,0.073478
1108,IRF7,MALAT1,0.249095,0.000571,0.073478
493,IRF7,ANXA5,0.275887,0.000642,0.073478
211,IRF7,GPR137B,0.397522,0.000659,0.073478
626,IRF7,HLA-C,0.268576,0.000686,0.073478
