__Author:__ Bram Van de Sande

__Date:__ 31 JAN 2018

__Outline:__ In the initial phase of the pySCENIC pipeline the single cell expression profiles are used to infer co-expression modules from.

For this notebook 3005 single cell transcriptomes taken from the mouse brain (somatosensory cortex and hippocampal regions) are used as an example.

> A. Zeisel, A. B. M͡oz-Manchado, S. Codeluppi, P. Lönnerberg, G. L. Manno, A. Juréus, S. Marques, H. Munguba, L. He, C. Betsholtz, C. Rolny, G. Castelo-Branco, J. Hjerling-Leffler, and S. Linnarsson, “Cell types in the mouse cortex and hippocampus revealed by single-cell RNA-seq,” Science, vol. 347, no. 6226, pp. 1138–1142, Mar. 2015.

In [1]:
import pandas as pd
import numpy as np
import os

from arboretum.algo import grnboost2
from arboretum.utils import load_tf_names

In [2]:
RESOURCES_FOLDER="/Users/bramvandesande/Projects/lcb/resources"
DATA_FOLDER="/Users/bramvandesande/Projects/lcb/tmp"

## Load the expression matrix

The scRNA-Seq data is downloaded from GEO: https://www.ncbi.nlm.nih.gov/geo/query/acc.cgi?acc=GSE60361 .

In [3]:
fname = os.path.join(RESOURCES_FOLDER, "GSE60361_C1-3005-Expression.txt")
ex_matrix = pd.read_csv(fname, sep='\t', header=0, index_col=0).T

In [4]:
ex_matrix.head()

cell_id,Tspan12,Tshz1,Fnbp1l,Adamts15,Cldn12,Rxfp1,2310042E22Rik,Sema3c,Jam2,Apbb1ip,...,Gm20826_loc1,Gm20826_loc2,Gm20877_loc2,Gm20877_loc1,Gm20865_loc4,Gm20738_loc4,Gm20738_loc6,Gm21943_loc1,Gm21943_loc3,Gm20738_loc3
1772071015_C02,0,3,3,0,1,0,0,11,1,0,...,0,0,0,0,0,0,0,0,0,0
1772071017_G12,0,1,1,0,1,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
1772071017_A05,0,0,6,0,1,0,2,25,1,0,...,0,0,0,0,0,0,0,0,0,0
1772071014_B06,3,2,4,0,0,0,3,1,0,0,...,0,0,0,0,0,0,0,0,0,0
1772067065_H06,0,2,1,0,0,0,0,10,0,0,...,0,0,0,0,0,0,0,0,0,0


In [5]:
ex_matrix.shape

(3005, 19972)

## Derive list of Transcription Factors(TF) for _Mus musculus_

List of known TFs for Mm was prepared (cf. notebook).

In [17]:
tf_names = load_tf_names(os.path.join(RESOURCES_FOLDER, 'mm_tfs.txt'))

## Run GRNBoost to infer co-expression modules

The arboretum package is used for this phase of the pipeline. For this notebook only a sample of 1,000 cells is used for the co-expression module inference is used.

In [25]:
N_SAMPLES = 1000

In [18]:
network = grnboost2(expression_data=ex_matrix.sample(n=N_SAMPLES),
                    tf_names=tf_names, verbose=True)

preparing dask client
parsing input
creating dask graph
computing dask graph
shutting down client and local cluster
finished


In [19]:
network.head()

Unnamed: 0,TF,target,importance
35,Rpl7,Rpl34-ps1,86.11987
172,Olig1,Cnp,70.127927
172,Olig1,Tspan2,69.595029
155,Neurod6,Hpca,68.236759
172,Olig1,Cers2,67.195168


In [27]:
len(network)

4109019

In [28]:
network.to_csv(os.path.join(DATA_FOLDER, "coexpression-modules.tsv"), index=False, sep='\t')