### Goal
Preprocessing rnaseq data

This is a tutorial with the main workflow for preparing a training dataset for a basic linear regression model. 
The output (X,Y) are all pair combination between tissues per each sample, where X is input, Y target.

In [27]:
# code autoreload
%load_ext autoreload
%autoreload 2

In [7]:
!pip install numpy
!pip install pandas
!pip install anndata

!pip install scanpy
!pip install argh



In [33]:
import rep.preprocessing as p
from rep.constants import ANNDATA_CST as a

%aimport

Modules to reload:
all-except-skipped

Modules to skip:



#### 1. Load raw count matrix + annotation and put it into an annodata object

In [9]:
%%writefile data.csv
T1_s1,T2_s2,T3_s1,T4_s2,T5_s1,T6_s1
G1,10,20,30,40,5,6
G2,5,10,20,30,5,6
G3,6,7,8,9,1,1

Writing data.csv


In [60]:
%%writefile anno_var.csv
Sample,Gender,Tissue,Individual,Technology
T1_s1,F,Tissue1,Sample1,rnaseq
T2_s2,M,Tissue2,Sample2,rnaseq
T3_s1,F,Tissue3,Sample1,rnaseq
T4_s2,M,Tissue4,Sample2,rnaseq
T5_s1,M,Tissue5,Sample1,rnaseq
T6_s1,M,Tissue6,Sample1,rnaseq

Overwriting anno_var.csv


In [11]:
%%writefile anno_obs.csv
G1,hg19,T1,chr1,1111,-
G2,hg19,T2,chr2,2222,-
G3,hg19,T3,chr3,3333,-

Writing anno_obs.csv


In [63]:
# annobj is the equivalent of the summarized experiments in R. 
# It contains 
# - annobj.X - counts
# - annobj.var - data.frame annotation object for the columns of the count matrix (in our case Tissue_Sample)
# - annobj.obs - data.frame annotation object for the rows of the count matrix (= features that we measure, in our case Genes)
# 
annobj = p.load_count_matrix("data.csv", sep=",", varanno="anno_var.csv", obsanno="anno_obs.csv")
p.print_anndata(annobj)

anndata.X
[[10. 20. 30. 40.  5.  6.]
 [ 5. 10. 20. 30.  5.  6.]
 [ 6.  7.  8.  9.  1.  1.]]
anndata.var
Index(['T1_s1', 'T2_s2', 'T3_s1', 'T4_s2', 'T5_s1', 'T6_s1'], dtype='object')
      Gender   Tissue Individual Technology
T1_s1      F  Tissue1    Sample1     rnaseq
T2_s2      M  Tissue2    Sample2     rnaseq
T3_s1      F  Tissue3    Sample1     rnaseq
T4_s2      M  Tissue4    Sample2     rnaseq
T5_s1      M  Tissue5    Sample1     rnaseq
T6_s1      M  Tissue6    Sample1     rnaseq
anndata.obs
Index(['G1', 'G2', 'G3'], dtype='object')
       1   2     3     4  5
G1  hg19  T1  chr1  1111  -
G2  hg19  T2  chr2  2222  -
G3  hg19  T3  chr3  3333  -



#### Save data to h2ad format

In [16]:
name = p.save(annobj)
name

... storing '1' as categorical
... storing '5' as categorical
... storing '1' as categorical
... storing '3' as categorical
... storing '4' as categorical


'/data/nasif12/home_if12/giurgiu/rep_gagneur/rep/notebooks/preprocessing/tmp1545316680.h5ad'

#### 2. Prepare training sample:
- select genes and tissue_samples for training
- slice the anndata object
- compute the cross tissue matrix pair (X,Y)

In [65]:
# filtering by different values in anndata.var and anndata.obs
# stored in a json format by specifying the {column over we would like to filter : array of admitted values}
# in this example we filter the anndata.var by gender (M) and anndata.obs should contain only G1 and G2
(var, obs) = p.filter_anndata_by_value(annobj, {a.VAR: {'Gender': ['M']}, a.OBS: {0: ['G1', 'G2']}})

# these are the rows and columns selected for the training set
(var,obs)

      Gender   Tissue Individual Technology
T1_s1      F  Tissue1    Sample1     rnaseq
T2_s2      M  Tissue2    Sample2     rnaseq
T3_s1      F  Tissue3    Sample1     rnaseq
T4_s2      M  Tissue4    Sample2     rnaseq
T5_s1      M  Tissue5    Sample1     rnaseq
T6_s1      M  Tissue6    Sample1     rnaseq


(['T2_s2', 'T6_s1', 'T5_s1', 'T4_s2'], ['G1', 'G2'])

In [67]:
# since we want to impute gene expression across different tissues, we
# would like to compute the cross tissue pairs per sample and not across samples

# for this purpose we can use the group_by(df, column, index_subset) 
# function to group tissues by sample

dict_samples_var = p.group_by(annobj.var, 'Individual', var)
dict_samples_var

{'Sample1': ['T6_s1', 'T5_s1'], 'Sample2': ['T2_s2', 'T4_s2']}

In [72]:
# compute cross tissue matrix pair (X, Y)
(X, Y) = p.rnaseq_cross_tissue(annobj, individuals=['Sample1','Sample2'], gene_ids=obs)

4.3 Slice anndata

anndata.X
[[20.  5. 30.  6. 10. 40.]
 [10.  5. 20.  6.  5. 30.]]
anndata.var
Index(['T2_s2', 'T5_s1', 'T3_s1', 'T6_s1', 'T1_s1', 'T4_s2'], dtype='object')
      Gender   Tissue Individual Technology
T2_s2      M  Tissue2    Sample2     rnaseq
T5_s1      M  Tissue5    Sample1     rnaseq
T3_s1      F  Tissue3    Sample1     rnaseq
T6_s1      M  Tissue6    Sample1     rnaseq
T1_s1      F  Tissue1    Sample1     rnaseq
T4_s2      M  Tissue4    Sample2     rnaseq
anndata.obs
Index(['G1', 'G2'], dtype='object')
       1   2     3     4  5
G1  hg19  T1  chr1  1111  -
G2  hg19  T2  chr2  2222  -

4.4 Build the two matrices X and Y



In [49]:
# train matrix
X

Unnamed: 0,G1,G2
T6_s1_T5_s1,6,6
T5_s1_T6_s1,5,5
T2_s2_T4_s2,20,10
T4_s2_T2_s2,40,30


In [50]:
# label matrix
Y

Unnamed: 0,G1,G2
T6_s1_T5_s1,5,5
T5_s1_T6_s1,6,6
T2_s2_T4_s2,40,30
T4_s2_T2_s2,20,10
