### Goal
Preprocessing rnaseq data

This is a tutorial with the main workflow for preparing a training dataset for a basic linear regression model. 
The output (X,Y) are all pair combination between tissues per each sample, where X is input, Y target.

In [2]:
# code autoreload
%load_ext autoreload
%autoreload 2

The autoreload extension is already loaded. To reload it, use:
  %reload_ext autoreload


In [2]:
!pip install numpy
!pip install pandas
!pip install anndata

!pip install scanpy
!pip install argh



In [3]:
import rep.preprocessing as p
from rep.constants import ANNDATA_CST as a

%aimport

Modules to reload:
all-except-skipped

Modules to skip:



#### 1. Load raw count matrix + annotation and put it into an annodata object

In [3]:
%%writefile data.csv
T1_s1,T2_s2,T3_s1,T4_s2,T5_s1,T6_s1,T1_s3
G1,10,20,30,40,5,6,23
G2,5,10,20,30,5,6,24
G3,6,7,8,9,1,1,25

Overwriting data.csv


In [4]:
%%writefile anno_var.csv
Sample,Gender,Tissue,Individual,Seq
T1_s1,F,Tissue1,Sample1,WGS
T2_s2,M,Tissue2,Sample2,WXS
T3_s1,F,Tissue3,Sample1,WGS
T4_s2,M,Tissue4,Sample2,WXS
T5_s1,M,Tissue5,Sample1,WGS
T6_s1,M,Tissue6,Sample1,WGS
T1_s3,F,Tissue1,Sample3,WGS

Overwriting anno_var.csv


In [5]:
%%writefile anno_obs.csv
Gene,assembly,transcript,chromosome,start,strand
G1,hg19,T1,chr1,1111,-
G2,hg19,T2,chr2,2222,-
G3,hg19,T3,chr3,3333,-

Overwriting anno_obs.csv


In [6]:
# annobj is the equivalent of the summarized experiments in R. 
# It contains 
# - annobj.X - counts
# - annobj.var - data.frame annotation object for the columns of the count matrix (in our case Tissue_Sample)
# - annobj.obs - data.frame annotation object for the rows of the count matrix (= features that we measure, in our case Genes)
# 

annobj = p.create_anndata("data.csv", sep=",", samples_anno="anno_var.csv", genes_anno="anno_obs.csv")
p.print_anndata(annobj)

anndata.X ----
[[10. 20. 30. 40.  5.  6. 23.]
 [ 5. 10. 20. 30.  5.  6. 24.]
 [ 6.  7.  8.  9.  1.  1. 25.]]
anndata.var ----
      Gender   Tissue Individual  Seq
T1_s1      F  Tissue1    Sample1  WGS
T2_s2      M  Tissue2    Sample2  WXS
T3_s1      F  Tissue3    Sample1  WGS
T4_s2      M  Tissue4    Sample2  WXS
T5_s1      M  Tissue5    Sample1  WGS
anndata.obs ----
   assembly transcript chromosome  start strand
G1     hg19         T1       chr1   1111      -
G2     hg19         T2       chr2   2222      -
G3     hg19         T3       chr3   3333      -



#### Save data to h2ad format

In [11]:
name = p.save(annobj)
name

... storing 'assembly' as categorical
... storing 'strand' as categorical
... storing 'Gender' as categorical
... storing 'Individual' as categorical
... storing 'Technology' as categorical


'/data/nasif12/home_if12/giurgiu/rep_gagneur/rep/notebooks/preprocessing/tmp1547121923.h5ad'

#### 2. Prepare training sample:
- select genes and tissue_samples for training
- slice the anndata object
- compute the cross tissue matrix pair (X,Y)

In [12]:
# filtering by different values in anndata.var and anndata.obs
# stored in a json format by specifying the {column over we would like to filter : array of admitted values}
# in this example we filter the anndata.var by gender (M) and anndata.obs should contain only G1 and G2
(var, obs) = p.filter_anndata_by_value(annobj, {a.SAMPLES: {'Gender': ['M']}, a.GENES: {0: ['G1', 'G2']}})

# these are the rows and columns selected for the training set
(var,obs)

(['T4_s2', 'T6_s1', 'T5_s1', 'T2_s2'], ['G2', 'G1'])

In [13]:
# since we want to impute gene expression across different tissues, we
# would like to compute the cross tissue pairs per sample and not across samples

# for this purpose we can use the group_by(df, column, index_subset) 
# function to group tissues by sample

dict_samples_var = p.group_by(annobj.var, 'Individual', var)
dict_samples_var

{'Sample1': ['T6_s1', 'T5_s1'], 'Sample2': ['T4_s2', 'T2_s2']}

In [7]:
# compute cross tissue matrix pair (X, Y)
(train,valid,test) = p.split_by_individuals(annobj,groupby=['Gender','Seq'])

Total individuals: 4
Individual split before balancing:  3 0 1
Iteration:  1
	Expc counts:  [4, 1, 1]
	Real counts:  6 0 1
	Expc fraction:  [0.6, 0.2, 0.2]
	Real fraction:  0.8571428571428571 0.0 0.14285714285714285
	Diff:  [0.2571428571428571, -0.2, -0.05714285714285716]
remove best
Min error:  0.02857142857142858
Sample2
Iteration:  2
	Expc counts:  [4, 1, 1]
	Real counts:  4 2 1
	Expc fraction:  [0.6, 0.2, 0.2]
	Real fraction:  0.5714285714285714 0.2857142857142857 0.14285714285714285
	Diff:  [-0.02857142857142858, 0.08571428571428569, -0.05714285714285716]
remove best
Min error:  0.2
Sample2
Iteration:  3
	Expc counts:  [4, 1, 1]
	Real counts:  4 0 3
	Expc fraction:  [0.6, 0.2, 0.2]
	Real fraction:  0.5714285714285714 0.0 0.42857142857142855
	Diff:  [-0.02857142857142858, -0.2, 0.22857142857142854]
remove best
Min error:  0.05714285714285716
Sample2
Iteration:  4
	Expc counts:  [4, 1, 1]
	Real counts:  6 0 1
	Expc fraction:  [0.6, 0.2, 0.2]
	Real fraction:  0.8571428571428571 0.0 0

In [8]:
(X_train, Y_train) = p.rnaseq_cross_tissue(annobj, individuals=train, gene_ids=annobj.obs_names)

samples_df  (7, 4)
samples_df_sliced  (6, 4)
compute all arrangements
Total pairs: 14


In [9]:
# train matrix
X_train[:5]

Unnamed: 0,G1,G2,G3
T1_s1_T6_s1,10.0,5.0,6.0
T1_s1_T3_s1,10.0,5.0,6.0
T1_s1_T5_s1,10.0,5.0,6.0
T6_s1_T1_s1,6.0,6.0,1.0
T6_s1_T3_s1,6.0,6.0,1.0


In [10]:
# train matrix
Y_train[:5]

Unnamed: 0,G1,G2,G3
T1_s1_T6_s1,6.0,6.0,1.0
T1_s1_T3_s1,30.0,20.0,8.0
T1_s1_T5_s1,5.0,5.0,1.0
T6_s1_T1_s1,10.0,5.0,6.0
T6_s1_T3_s1,30.0,20.0,8.0


In [28]:
# train matrix
X_train[:5]

Unnamed: 0,G1,G2,G3
T3_s1_T6_s1,30.0,20.0,8.0
T3_s1_T1_s1,30.0,20.0,8.0
T3_s1_T5_s1,30.0,20.0,8.0
T6_s1_T3_s1,6.0,6.0,1.0
T6_s1_T1_s1,6.0,6.0,1.0


In [16]:
# label matrix
Y_train[:5]

Unnamed: 0,G1,G2,G3
T3_s1_T6_s1,0.845098,0.845098,0.30103
T3_s1_T1_s1,1.041393,0.778151,0.845098
T3_s1_T5_s1,0.778151,0.778151,0.30103
T6_s1_T3_s1,1.491362,1.322219,0.954243
T6_s1_T1_s1,1.041393,0.778151,0.845098


### 3. Create toy large dataset

In [4]:
import random
import math
import numpy.random as nr
import numpy as np
import pandas as pd
import seaborn as sns
import matplotlib.pyplot as plt
from scipy import stats

import anndata

In [5]:
# genes
genes = list(map(lambda x: "Gene_" + str(x) , list(range(1, 20001))))
genes[19990:]

['Gene_19991',
 'Gene_19992',
 'Gene_19993',
 'Gene_19994',
 'Gene_19995',
 'Gene_19996',
 'Gene_19997',
 'Gene_19998',
 'Gene_19999',
 'Gene_20000']

In [6]:
raw_counts = nr.negative_binomial(3,0.001,size=20000)
# sns.distplot(raw_counts)
raw_counts[:10]

array([1837, 7955,  505, 4570, 1270, 4205, 5287, 1949, 1877, 5478])

In [7]:
# tissues
tissues = ['Adipose - Subcutaneous',
 'Adipose - Visceral (Omentum)',
 'Adrenal Gland',
 'Artery - Aorta',
 'Artery - Coronary',
 'Artery - Tibial',
 'Bladder',
 'Brain - Amygdala',
 'Brain - Anterior cingulate cortex (BA24)',
 'Brain - Caudate (basal ganglia)',
 'Brain - Cerebellar Hemisphere',
 'Brain - Cerebellum',
 'Brain - Cortex',
 'Brain - Frontal Cortex (BA9)',
 'Brain - Hippocampus',
 'Brain - Hypothalamus',
 'Brain - Nucleus accumbens (basal ganglia)',
 'Brain - Putamen (basal ganglia)',
 'Brain - Spinal cord (cervical c-1)',
 'Brain - Substantia nigra',
 'Breast - Mammary Tissue',
 'Cells - EBV-transformed lymphocytes',
 'Cells - Transformed fibroblasts',
 'Cervix - Ectocervix',
 'Cervix - Endocervix',
 'Colon - Sigmoid',
 'Colon - Transverse',
 'Esophagus - Gastroesophageal Junction',
 'Esophagus - Mucosa',
 'Esophagus - Muscularis',
 'Fallopian Tube',
 'Heart - Atrial Appendage',
 'Heart - Left Ventricle',
 'Kidney - Cortex',
 'Liver',
 'Lung',
 'Minor Salivary Gland',
 'Muscle - Skeletal',
 'Nerve - Tibial',
 'Ovary',
 'Pancreas',
 'Pituitary',
 'Prostate',
 'Skin - Not Sun Exposed (Suprapubic)',
 'Skin - Sun Exposed (Lower leg)',
 'Small Intestine - Terminal Ileum',
 'Spleen',
 'Stomach',
 'Testis',
 'Thyroid',
 'Uterus',
 'Vagina',
 'Whole Blood']

In [8]:
# simulate how many tissues are collected per patient
distribution_tissues = nr.uniform(1,len(tissues),1000)
distribution_tissues.shape,distribution_tissues[:10]

((1000,),
 array([23.45844837, 32.42258621, 43.67978791, 46.51771167, 32.30568135,
        31.01255899,  5.25819487, 25.8020789 , 28.54055117, 48.85421578]))

In [9]:
# generate samples - raw counts per gene per tissue
frames = []
for i in range(1,150): # patients
  
    # choose tissues
    how_many_tissues = math.floor(distribution_tissues[random.randint(0,len(distribution_tissues))-1])
    random.shuffle(tissues)
    t_set = tissues[:how_many_tissues]  
  
    # initialize count matrix
    counts = pd.DataFrame(columns=genes,index=map(lambda x: "Patient_" + str(i) + "_" + str(x), t_set))

    for j in range(0,len(t_set)):
    
        col_name = "Patient_" + str(i) + "_" + t_set[j]
    
        # generate raw counts per tissue per set of genes
        raw_counts = nr.negative_binomial(3,0.001,size=20000)
        counts.loc[col_name] = raw_counts.reshape(1,20000)
    
    # append expression per tissue for the patient 
    frames.append(counts)
      

df_big = pd.concat(frames)

In [10]:
# transpose the matrix to be genes x patient_tissue
count_matrix = df_big.transpose()
count_matrix.shape

(20000, 4063)

In [11]:
# create annotation dataframes
# column = 0 its the index! be careful
samples_anno = pd.DataFrame(index=count_matrix.columns,columns=['Individual','Tissue'])
gender=[1,2]
seq=['WGS','WXS']

for i in range(0,samples_anno.shape[0]):
    sample = samples_anno.index[i]
    [pat,number,tissue] = sample.split("_")
  
    samples_anno.at[sample, 'Individual'] = pat + "_" + str(number)
    samples_anno.at[sample, 'Tissue'] = tissue
    samples_anno.at[sample, 'Gender'] = gender[int(number)%2]
    samples_anno.at[sample, 'Seq'] = seq[int(number)%2]

samples_anno.drop_duplicates().shape
samples_anno[:10]


Unnamed: 0,Individual,Tissue,Gender,Seq
Patient_1_Artery - Coronary,Patient_1,Artery - Coronary,2.0,WXS
Patient_1_Artery - Tibial,Patient_1,Artery - Tibial,2.0,WXS
Patient_1_Brain - Cerebellar Hemisphere,Patient_1,Brain - Cerebellar Hemisphere,2.0,WXS
Patient_1_Skin - Sun Exposed (Lower leg),Patient_1,Skin - Sun Exposed (Lower leg),2.0,WXS
Patient_1_Adipose - Subcutaneous,Patient_1,Adipose - Subcutaneous,2.0,WXS
Patient_1_Uterus,Patient_1,Uterus,2.0,WXS
Patient_1_Muscle - Skeletal,Patient_1,Muscle - Skeletal,2.0,WXS
Patient_1_Brain - Substantia nigra,Patient_1,Brain - Substantia nigra,2.0,WXS
Patient_1_Ovary,Patient_1,Ovary,2.0,WXS
Patient_1_Testis,Patient_1,Testis,2.0,WXS


In [12]:
genes_anno = pd.DataFrame(index=count_matrix.index)
genes_anno.shape

(20000, 0)

In [13]:
# create annobj
annobj = anndata.AnnData(count_matrix)
annobj.var = samples_anno
annobj.var_names = list(samples_anno.index)
annobj.obs = genes_anno
annobj.obs_names = list(genes_anno.index)

p.print_anndata(annobj)

anndata.X ----
[[2254. 2468. 5217. ... 1010.  651. 2448.]
 [4728. 2741. 9423. ... 3820. 3645. 3259.]
 [1844. 6626. 1925. ... 2535. 3043. 3148.]
 ...
 [1005. 2865. 2508. ...  370. 1902. 1089.]
 [7190. 2160.  928. ... 1992.  761. 1185.]
 [ 820. 1601. 3391. ... 3824. 2825. 2001.]]
anndata.var ----
                                         Individual  \
Patient_1_Artery - Coronary               Patient_1   
Patient_1_Artery - Tibial                 Patient_1   
Patient_1_Brain - Cerebellar Hemisphere   Patient_1   
Patient_1_Skin - Sun Exposed (Lower leg)  Patient_1   
Patient_1_Adipose - Subcutaneous          Patient_1   

                                                                  Tissue  \
Patient_1_Artery - Coronary                            Artery - Coronary   
Patient_1_Artery - Tibial                                Artery - Tibial   
Patient_1_Brain - Cerebellar Hemisphere    Brain - Cerebellar Hemisphere   
Patient_1_Skin - Sun Exposed (Lower leg)  Skin - Sun Exposed (Lower l

In [14]:
%time (train,valid,test) = p.split_by_individuals(annobj,groupby=['Gender','Seq'])
"Individual split: ", len(train),len(valid),len(test)

Total individuals: 149
Individual split before balancing:  89 29 31
Iteration:  1
	Expc counts:  [2437, 812, 812]
	Real counts:  2361 805 897
	Expc fraction:  [0.6, 0.2, 0.2]
	Real fraction:  0.5810977110509475 0.1981294609894167 0.22077282795963574
	Diff:  [-0.018902288949052437, -0.0018705390105833164, 0.020772827959635726]
remove best
Min error:  0.007974403150381476
Patient_132
Iteration:  2
	Expc counts:  [2437, 812, 812]
	Real counts:  2413 805 845
	Expc fraction:  [0.6, 0.2, 0.2]
	Real fraction:  0.5938961358602018 0.1981294609894167 0.2079744031503815
	Diff:  [-0.006103864139798132, -0.0018705390105833164, 0.007974403150381476]
remove best
Min error:  0.0045778981048486544
Patient_118
Iteration:  3
	Expc counts:  [2437, 812, 812]
	Real counts:  2464 805 794
	Expc fraction:  [0.6, 0.2, 0.2]
	Real fraction:  0.606448437115432 0.1981294609894167 0.19542210189515136
	Diff:  [0.006448437115431971, -0.0018705390105833164, -0.0045778981048486544]
remove best
Min error:  0.000689145951

('Individual split: ', 90, 29, 30)

In [30]:
# slice  + compute all pairs
%time (X_train, Y_train) = p.rnaseq_cross_tissue(annobj, individuals=train,gene_ids=annobj.obs_names,input_transform='log10')
%time (X_valid, Y_valid) = p.rnaseq_cross_tissue(annobj, individuals=valid,gene_ids=annobj.obs_names)
%time (X_test, Y_test) = p.rnaseq_cross_tissue(annobj, individuals=test,gene_ids=annobj.obs_names)

samples_df  (4063, 4)
samples_df_sliced  (2435, 4)
compute all arrangements
Total pairs: 84696
CPU times: user 21.7 s, sys: 11.8 s, total: 33.4 s
Wall time: 15.8 s
samples_df  (4063, 4)
samples_df_sliced  (808, 4)
compute all arrangements
Total pairs: 27030
CPU times: user 14.4 s, sys: 4.98 s, total: 19.4 s
Wall time: 5.76 s
samples_df  (4063, 4)
samples_df_sliced  (820, 4)
compute all arrangements
Total pairs: 26526
CPU times: user 14.3 s, sys: 5.48 s, total: 19.8 s
Wall time: 6.05 s


In [31]:
# train data
X_train[:10]

Unnamed: 0,Gene_1,Gene_2,Gene_3,Gene_4,Gene_5,Gene_6,Gene_7,Gene_8,Gene_9,Gene_10,...,Gene_19991,Gene_19992,Gene_19993,Gene_19994,Gene_19995,Gene_19996,Gene_19997,Gene_19998,Gene_19999,Gene_20000
Patient_1_Liver_Patient_1_Esophagus - Muscularis,3.916033,3.713659,3.405517,3.543074,3.619406,3.448242,3.617734,3.066326,3.318689,3.703463,...,3.129045,3.50637,3.386677,3.308351,3.582972,3.700877,3.12969,3.635182,3.661055,3.289143
Patient_1_Liver_Patient_1_Prostate,3.916033,3.713659,3.405517,3.543074,3.619406,3.448242,3.617734,3.066326,3.318689,3.703463,...,3.129045,3.50637,3.386677,3.308351,3.582972,3.700877,3.12969,3.635182,3.661055,3.289143
Patient_1_Liver_Patient_1_Brain - Amygdala,3.916033,3.713659,3.405517,3.543074,3.619406,3.448242,3.617734,3.066326,3.318689,3.703463,...,3.129045,3.50637,3.386677,3.308351,3.582972,3.700877,3.12969,3.635182,3.661055,3.289143
Patient_1_Liver_Patient_1_Brain - Hypothalamus,3.916033,3.713659,3.405517,3.543074,3.619406,3.448242,3.617734,3.066326,3.318689,3.703463,...,3.129045,3.50637,3.386677,3.308351,3.582972,3.700877,3.12969,3.635182,3.661055,3.289143
Patient_1_Liver_Patient_1_Artery - Coronary,3.916033,3.713659,3.405517,3.543074,3.619406,3.448242,3.617734,3.066326,3.318689,3.703463,...,3.129045,3.50637,3.386677,3.308351,3.582972,3.700877,3.12969,3.635182,3.661055,3.289143
Patient_1_Liver_Patient_1_Testis,3.916033,3.713659,3.405517,3.543074,3.619406,3.448242,3.617734,3.066326,3.318689,3.703463,...,3.129045,3.50637,3.386677,3.308351,3.582972,3.700877,3.12969,3.635182,3.661055,3.289143
Patient_1_Liver_Patient_1_Adrenal Gland,3.916033,3.713659,3.405517,3.543074,3.619406,3.448242,3.617734,3.066326,3.318689,3.703463,...,3.129045,3.50637,3.386677,3.308351,3.582972,3.700877,3.12969,3.635182,3.661055,3.289143
Patient_1_Liver_Patient_1_Uterus,3.916033,3.713659,3.405517,3.543074,3.619406,3.448242,3.617734,3.066326,3.318689,3.703463,...,3.129045,3.50637,3.386677,3.308351,3.582972,3.700877,3.12969,3.635182,3.661055,3.289143
Patient_1_Liver_Patient_1_Skin - Sun Exposed (Lower leg),3.916033,3.713659,3.405517,3.543074,3.619406,3.448242,3.617734,3.066326,3.318689,3.703463,...,3.129045,3.50637,3.386677,3.308351,3.582972,3.700877,3.12969,3.635182,3.661055,3.289143
Patient_1_Liver_Patient_1_Thyroid,3.916033,3.713659,3.405517,3.543074,3.619406,3.448242,3.617734,3.066326,3.318689,3.703463,...,3.129045,3.50637,3.386677,3.308351,3.582972,3.700877,3.12969,3.635182,3.661055,3.289143


In [32]:
X_valid[:5]

Unnamed: 0,Gene_1,Gene_2,Gene_3,Gene_4,Gene_5,Gene_6,Gene_7,Gene_8,Gene_9,Gene_10,...,Gene_19991,Gene_19992,Gene_19993,Gene_19994,Gene_19995,Gene_19996,Gene_19997,Gene_19998,Gene_19999,Gene_20000
Patient_101_Artery - Tibial_Patient_101_Liver,3903.0,1838.0,2472.0,2963.0,3733.0,2716.0,1810.0,1186.0,5651.0,1823.0,...,1102.0,3748.0,2659.0,5499.0,1430.0,3045.0,6874.0,1187.0,1010.0,2281.0
Patient_101_Artery - Tibial_Patient_101_Brain - Hippocampus,3903.0,1838.0,2472.0,2963.0,3733.0,2716.0,1810.0,1186.0,5651.0,1823.0,...,1102.0,3748.0,2659.0,5499.0,1430.0,3045.0,6874.0,1187.0,1010.0,2281.0
Patient_101_Artery - Tibial_Patient_101_Cells - EBV-transformed lymphocytes,3903.0,1838.0,2472.0,2963.0,3733.0,2716.0,1810.0,1186.0,5651.0,1823.0,...,1102.0,3748.0,2659.0,5499.0,1430.0,3045.0,6874.0,1187.0,1010.0,2281.0
Patient_101_Artery - Tibial_Patient_101_Brain - Nucleus accumbens (basal ganglia),3903.0,1838.0,2472.0,2963.0,3733.0,2716.0,1810.0,1186.0,5651.0,1823.0,...,1102.0,3748.0,2659.0,5499.0,1430.0,3045.0,6874.0,1187.0,1010.0,2281.0
Patient_101_Artery - Tibial_Patient_101_Heart - Atrial Appendage,3903.0,1838.0,2472.0,2963.0,3733.0,2716.0,1810.0,1186.0,5651.0,1823.0,...,1102.0,3748.0,2659.0,5499.0,1430.0,3045.0,6874.0,1187.0,1010.0,2281.0


In [29]:
annobj[:,'Patient_101_Artery - Tibial'].X[:7]

ArrayView([3903., 1838., 2472., 2963., 3733., 2716., 1810.], dtype=float32)