### Overview
This notebook converts an scRNA-seq file (that is not used in any prior steps) to a pseudo-bulk RNA-seq via aggregated summation of gene expressions. Method via aggregation by mean is also provided in the code.

The pseudo-bulk RNA-seq will be exported to a format suitable to be used in CIBERSORTx [3]. 

**This notebook is written in Python.**

In [1]:
#Libraries and global settings
import scanpy as sc
import pandas as pd

%matplotlib inline 

#### Import scRNA-seq 

In [2]:
#Import scRNA-seq 
path ="../../data/demo_public/input/AllenBrain_for_bulk.h5ad"
adata_bulk = sc.read(path, cache=True)

#### Convert scRNA-seq to pseudo-bulk RNA-seq

In [3]:
#Get expression matrix
d_adata_bulk_exprs = adata_bulk.X
l_adata_bulk_genes = adata_bulk.var_names.tolist()
l_adata_bulk_sampleid = adata_bulk.obs["Allen_sampleID"].tolist()

# Create dataframe and transpose it
df_exprs_sampleid = pd.DataFrame(d_adata_bulk_exprs, index = l_adata_bulk_sampleid, columns=l_adata_bulk_genes)
df_exprs_sampleid_t = df_exprs_sampleid.transpose()
df_exprs_sampleid_t.head()

Unnamed: 0,446701,410107,446701.1,446701.2,446701.3,410108,410108.1,446701.4,410107.1,446701.5,...,446701.6,410107.2,410107.3,410107.4,446701.7,410107.5,410107.6,446701.8,410107.7,410107.8
Xkr4,7.0,3.0,3.0,13.0,15.0,14.0,24.0,4.0,2.0,12.0,...,5.0,3.0,8.0,4.0,9.0,0.0,16.0,12.0,6.0,0.0
Gm1992,0.0,1.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
Gm37381,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
Rp1,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
Sox17,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0


In [4]:
#Collapse columns with same name (SampleID). Refer to [4] for a suitable aggregation method. 

df_exprs_sampleid_t=df_exprs_sampleid_t.groupby(level=0, axis=1).sum()  #aggregation by sum 
#Comment the line above and uncomment the line below to use aggregation by mean
#df_exprs_sampleid_t=df_exprs_sampleid_t.groupby(level=0, axis=1).mean() #aggregation by mean

df_exprs_sampleid_t.head()

Unnamed: 0,410107,410108,446701
Xkr4,10713.0,12212.0,24810.0
Gm1992,305.0,353.0,418.0
Gm37381,4.0,6.0,6.0
Rp1,22.0,22.0,39.0
Sox17,2.0,3.0,19.0


#### Export data

In [5]:
#Write data
path = "../../data/demo_public/output/cibersortx_pseudobulk.txt"
df_exprs_sampleid_t.to_csv(path, sep='\t', chunksize=500)

##### Reference
1. Chia, C. M., Roig Adam, A., & Moro, A. (2022). *In silico* multiple single-subject neural tissue screening using deconvolution on pseudo-bulk RNA-seq - a prototype. Bioinformatics and Systems Biology joint degree program. Vrije Universiteit Amsterdam and University of Amsterdam. 

2. Allen Institute for Brain Science (2004). Allen Mouse Brain Atlas, Mouse Whole Cortex and Hippocampus 10x. Available from mouse.brain-map.org. Allen Institute for Brain Science (2011).

3. Newman, A. M., Liu, C. L., Green, M. R., Gentles, A. J., Feng, W., Xu, Y., Hoang, C. D., Diehn, M., & Alizadeh, A. A. (2015). Robust enumeration of cell subsets from tissue expression profiles. Nature methods, 12(5), 453–457. https://doi.org/10.1038/nmeth.3337

4. Junttila, S., Smolander, J., & Elo, L. L. (2022). Benchmarking methods for detecting differential states between conditions from multi-subject single-cell RNA-seq data. Briefings in bioinformatics, 23(5), bbac286. https://doi.org/10.1093/bib/bbac286