# Documenting how to compute mean for RNAseq datasets

## Mean
- There are 2 ways (that I am aware of) to compute mean
    - Script from Rachel (original script is `single_cell_MAGMA_files_v2.py` but I have modified this to `compute_sumstat_magma.py`). For convenience, I will call this method `mean_vanilla`
    - EWCE: method used in Skene papers. I will call this `mean_ewce`
- Below I will attempt to demonstrate how to calculate these

### mean_vanilla
- First, normalize counts per cell 
- Second, log-transform
- Third, compute mean

- In the following code I will walk through an example.

In [1]:
import scanpy as sc
import anndata
import pandas as pd

- First, read in and clean the h5ad file

In [2]:
adata_path = "scrna/41_Siletti_Cerebellum.CBV_Human_2022/41_Siletti_Cerebellum.CBV_Human_2022.h5ad" #this is the h5ad file after preprocessing
adata_original = anndata.read(adata_path)
print(f"Originally, adata has {adata_original.shape[0]} rows (cells) and {adata_original.shape[1]} columns (genes)")

# because the scRNAseq data contains genes that don't have ensemble, it is cleaner to create a filtered adata where:
# genes that were not convertible to ensembl
adata_original.var_names_make_unique()
gene_list = adata_original.var[adata_original.var["ensembl"].notnull()].index.values.tolist() #this obtains a list of gene in symbol that has an ensemble conversion
adata = adata_original[:, gene_list].copy() #subset based on the genes (keep the genes if it was converted to ensemble successfully)

# now, adata.X is the matrix in numpy format where row is cell ID and column is the gene\
print(f"After cleaning, adata has {adata.shape[0]} rows (cells) and {adata.shape[1]} columns (genes)")

  utils.warn_names_duplicates("var")


Originally, adata has 71852 rows (cells) and 36515 columns (genes)
After cleaning, adata has 71852 rows (cells) and 24617 columns (genes)


- Second, clean by normalizing and log-transforming

In [3]:
# normalize counts per cell (in place)
sc.pp.normalize_total(adata, target_sum=1e6)

# log-transform the original data
sc.pp.log1p(adata.X, base=2)

<71852x24617 sparse matrix of type '<class 'numpy.float32'>'
	with 174317471 stored elements in Compressed Sparse Row format>

- Third, compute mean

In [4]:
ct_colname = "supercluster_term"
# define the genes in the dataset
genes = adata.var["ensembl"].to_list()

# define the available cell types in the dataset
cts = adata.obs[ct_colname].dropna().unique()

means_cell_log_counts_pM = pd.DataFrame(data=None, index=cts, columns=genes, dtype=float)

for ct in cts:
    Y = adata[adata.obs[ct_colname] == ct, :].to_df()
    Y.columns = genes
    means_cell_log_counts_pM.loc[ct, :] = Y.mean(0)

# convert from gene symbol to ensemble gene
means_cell_log_counts_pM.loc["Average"] = (means_cell_log_counts_pM.mean(axis=0))
means_cell_log_counts_pM.index = [w.replace(' ', '_') for w in means_cell_log_counts_pM.index.values]
means_cell_log_counts_pM.index = [w.replace('/', '_') for w in means_cell_log_counts_pM.index.values]
means_cell_log_counts_pM.index = [w.replace(':', '_') for w in means_cell_log_counts_pM.index.values]

means_cell_log_counts_pM_t = means_cell_log_counts_pM.T
means_cell_log_counts_pM_t.index.name = "GENE"
means_cell_log_counts_pM_t.reset_index(inplace=True)
means_cell_log_counts_pM_out = means_cell_log_counts_pM_t.drop_duplicates(subset=["GENE"])

In [5]:
print(means_cell_log_counts_pM_out.head())

              GENE  Committed_oligodendrocyte_precursor  Oligodendrocyte   
0  ENSG00000284678                             0.245080         0.025246  \
1  ENSG00000167995                             2.768120         1.567038   
2  ENSG00000204655                             4.400767         8.467441   
3  ENSG00000253807                             8.470038         5.498433   
4  ENSG00000169247                             0.553338         5.730296   

   Oligodendrocyte_precursor  Splatter  Upper_rhombic_lip   
0                   0.017279       0.0           0.007333  \
1                   0.707532       0.0           0.035181   
2                   0.287125       0.0           0.060208   
3                   2.357371       0.0           0.090106   
4                   0.036469       0.0           0.019087   

   Cerebellar_inhibitory  Miscellaneous  Astrocyte  Bergmann_glia  Ependymal   
0               0.021989       0.000000   0.000000       0.000000        0.0  \
1               

- **NOTES**: For some of the scRNAseq, the `mean_vanilla` was already computed and stored on snellius: `/gpfs/work5/0/vusr0480/Processed_scRNA/data/magma/`

### mean_ewce
- When implementing ewce we can get out both the specificity and mean. Therefore, please refer to the file `compute_spec.ipynb` for more information.