### Goal
Compare gene expression of recount2 and Expecto.
 - recount2 -> https://jhubiostatistics.shinyapps.io/recount/
 - expecto -> https://www.nature.com/articles/s41588-018-0160-6

Expecto GTEx entries are computed as:  log(mean(over gene expression acrros tissues + 0.0001))
- the geneanno.exp.csv contains 218 tissues, where the first 53 are GTex entries
- GTEx v6 (this is done using hg19)
- GENCODE v24 (this relates to GRCh38.p5)
- lift anno to hg19/GRCh37

Recount2 GTEx data was downloaded using the Bioconductor package recount. Please check /s/project/gtex-processed/recount/loadGtexData.R
- RSE gene - The RangedSummarizedExperiment object for the counts summarized at the gene level using the Gencode v25 (GRCh38.p7, CHR) annotation 

Additionally - compare Expecto and GTEx V6 (direct download) - this should have a higher correlation

In [51]:
%load_ext autoreload
%autoreload 2

import os

import pandas as pd
import scipy.stats as s

The autoreload extension is already loaded. To reload it, use:
  %reload_ext autoreload


#### 1. Expecto data

In [31]:
# data
expecto_counts_file = "/s/project/avsec/ExPecto/resources/geneanno.exp.csv"
expecto_anno_file = "/s/project/avsec/ExPecto/resources/geneanno.csv"

In [32]:
expecto_anno = pd.DataFrame(pd.read_csv(os.path.abspath(expecto_anno_file), header=0, delimiter=",", index_col=0))
expecto_anno[:5]

Unnamed: 0_level_0,symbol,seqnames,strand,TSS,CAGE_representative_TSS,type
id,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1
ENSG00000000003,TSPAN6,chrX,-,99894988,99891748,protein_coding
ENSG00000000005,TNMD,chrX,+,99839799,99839933,protein_coding
ENSG00000000419,DPM1,chr20,-,49575092,49575069,protein_coding
ENSG00000000457,SCYL3,chr1,-,169863408,169863037,protein_coding
ENSG00000000460,C1orf112,chr1,+,169631245,169764186,protein_coding


In [33]:
expecto_counts = pd.DataFrame(pd.read_csv(os.path.abspath(expecto_counts_file), header=0, delimiter=",", index_col=0))
expecto_counts.index = expecto_anno.index
expecto_counts = expecto_counts.iloc[:,:53]
expecto_counts.shape

(24339, 53)

#### 2. Recount2 data

In [34]:
# data
recount2_counts_file = "/s/project/gtex-processed/recount/version2/recount_mean_tissues.csv"

In [35]:
recount2 = pd.DataFrame(pd.read_csv(os.path.abspath(recount2_counts_file), header=0, delimiter="\t", index_col=0))

# ignore ensembl id version
recount2.index.name = 'ensemblid'
new_index = [item.split(".")[0] for item in recount2.index.tolist()]
recount2.set_index([new_index],inplace=True,drop=True)

# remove 'Cells-Leukemiacellline(CML)' - column
recount2.drop(['Cells - Leukemia cell line (CML)'], axis=1, inplace=True)
recount2.shape

(58037, 53)

#### 3. Compare
Keep only rows with same annotation (ignore ensembl id version).
Per tissue do spearman correlation

In [36]:
recount2.columns.values.tolist()

['Adipose - Subcutaneous',
 'Adipose - Visceral (Omentum)',
 'Adrenal Gland',
 'Artery - Aorta',
 'Artery - Coronary',
 'Artery - Tibial',
 'Bladder',
 'Brain - Amygdala',
 'Brain - Anterior cingulate cortex (BA24)',
 'Brain - Caudate (basal ganglia)',
 'Brain - Cerebellar Hemisphere',
 'Brain - Cerebellum',
 'Brain - Cortex',
 'Brain - Frontal Cortex (BA9)',
 'Brain - Hippocampus',
 'Brain - Hypothalamus',
 'Brain - Nucleus accumbens (basal ganglia)',
 'Brain - Putamen (basal ganglia)',
 'Brain - Spinal cord (cervical c-1)',
 'Brain - Substantia nigra',
 'Breast - Mammary Tissue',
 'Cells - EBV-transformed lymphocytes',
 'Cells - Transformed fibroblasts',
 'Cervix - Ectocervix',
 'Cervix - Endocervix',
 'Colon - Sigmoid',
 'Colon - Transverse',
 'Esophagus - Gastroesophageal Junction',
 'Esophagus - Mucosa',
 'Esophagus - Muscularis',
 'Fallopian Tube',
 'Heart - Atrial Appendage',
 'Heart - Left Ventricle',
 'Kidney - Cortex',
 'Liver',
 'Lung',
 'Minor Salivary Gland',
 'Muscle - Sk

In [37]:
expecto_counts.columns.values.tolist()

['Adipose_Subcutaneous',
 'Adipose_Visceral_Omentum',
 'Adrenal_Gland',
 'Artery_Aorta',
 'Artery_Coronary',
 'Artery_Tibial',
 'Bladder',
 'Brain_Amygdala',
 'Brain_Anterior_cingulate_cortex_BA24',
 'Brain_Caudate_basal_ganglia',
 'Brain_Cerebellar_Hemisphere',
 'Brain_Cerebellum',
 'Brain_Cortex',
 'Brain_Frontal_Cortex_BA9',
 'Brain_Hippocampus',
 'Brain_Hypothalamus',
 'Brain_Nucleus_accumbens_basal_ganglia',
 'Brain_Putamen_basal_ganglia',
 'Brain_Spinal_cord_cervical_c1',
 'Brain_Substantia_nigra',
 'Breast_Mammary_Tissue',
 'Cells_EBV-transformed_lymphocytes',
 'Cells_Transformed_fibroblasts',
 'Cervix_Ectocervix',
 'Cervix_Endocervix',
 'Colon_Sigmoid',
 'Colon_Transverse',
 'Esophagus_Gastroesophageal_Junction',
 'Esophagus_Mucosa',
 'Esophagus_Muscularis',
 'Fallopian_Tube',
 'Heart_Atrial_Appendage',
 'Heart_Left_Ventricle',
 'Kidney_Cortex',
 'Liver',
 'Lung',
 'Minor_Salivary_Gland',
 'Muscle_Skeletal',
 'Nerve_Tibial',
 'Ovary',
 'Pancreas',
 'Pituitary',
 'Prostate',
 'S

In [38]:
joined_df = recount2.join(expecto_counts, how='inner', lsuffix='_left', rsuffix='_right')
joined_df.shape # total of 24048 genes could be mapped

(24048, 106)

In [39]:
recount2 = joined_df.iloc[:,:53]
expecto_counts = joined_df.iloc[:,53:]
recount2.shape, expecto_counts.shape

((24048, 53), (24048, 53))

In [40]:
colnames = expecto_counts.columns.values.tolist()
for i in range(0,recount2.shape[1]):    
    print(s.spearmanr(recount2.iloc[i].values,expecto_counts.iloc[i].values),colnames[i])

SpearmanrResult(correlation=0.9838735687792292, pvalue=8.543761829803235e-40) Adipose_Subcutaneous
SpearmanrResult(correlation=0.9740364457345588, pvalue=1.4284066127785971e-34) Adipose_Visceral_Omentum
SpearmanrResult(correlation=0.8039025963554266, pvalue=4.245306429873534e-13) Adrenal_Gland
SpearmanrResult(correlation=0.945250766005483, pvalue=1.849377369033238e-26) Artery_Aorta
SpearmanrResult(correlation=0.9604902435091114, pvalue=5.4213608883148544e-30) Artery_Coronary
SpearmanrResult(correlation=0.9768585711981939, pvalue=7.854701103222134e-36) Artery_Tibial
SpearmanrResult(correlation=0.992823738106757, pvalue=1.0254111217968367e-48) Bladder_right
SpearmanrResult(correlation=0.966779551685212, pvalue=7.026477505939181e-32) Brain_Amygdala
SpearmanrResult(correlation=0.9634736332849541, pvalue=7.58857065169983e-31) Brain_Anterior_cingulate_cortex_BA24
SpearmanrResult(correlation=0.8879213030156425, pvalue=7.868755462656718e-19) Brain_Caudate_basal_ganglia
SpearmanrResult(correlat

In [29]:
recount2.iloc[:10,2], expecto_counts.iloc[:10,2]    

(ENSG00000000003    11.955509
 ENSG00000000005     6.086746
 ENSG00000000419    12.065735
 ENSG00000000457    11.510704
 ENSG00000000460    10.664779
 ENSG00000000938    11.174362
 ENSG00000000971    13.005532
 ENSG00000001036    12.812286
 ENSG00000001084    12.336624
 ENSG00000001167    11.664301
 Name: Adrenal Gland, dtype: float64, ENSG00000000003    11.349336
 ENSG00000000005     0.046674
 ENSG00000000419    29.331713
 ENSG00000000457     4.271980
 ENSG00000000460     0.658967
 ENSG00000000938     4.254194
 ENSG00000000971    20.765548
 ENSG00000001036    26.039038
 ENSG00000001084     8.453453
 ENSG00000001167     4.415666
 Name: Adrenal_Gland, dtype: float64)

#### 4. Compare Expecto and GTEx V6
--- not yet done

For this we compute for the GTEx the log(mean) across individuals for each tissue in the same manner as for recount2 data

In [57]:
gtex_file = "/s/project/gtex-processed/gene_counts_v6/recount_mean_tissues.csv"
gtex = pd.DataFrame(pd.read_csv(os.path.abspath(gtex_file), header=0, delimiter="\t", index_col=0))

In [58]:
# ignore ensembl id version
new_index = [item.split(".")[0] for item in gtex.index.tolist()]
gtex.set_index([new_index],inplace=True,drop=True)
gtex.shape

(56318, 53)

In [59]:
joined_df = gtex.join(expecto_counts, how='inner', lsuffix='_left', rsuffix='_right')
joined_df.shape # total of 24048 genes could be mapped

(24048, 106)

In [62]:
gtex = joined_df.iloc[:,:53]
gtex[gtex < 0] = 0
expecto_counts = joined_df.iloc[:,53:]
gtex.shape, expecto_counts.shape

((24048, 53), (24048, 53))

In [63]:
colnames = expecto_counts.columns.values.tolist()
for i in range(0,gtex.shape[1]):    
    print(s.spearmanr(gtex.iloc[i].values,expecto_counts.iloc[i].values),colnames[i])

SpearmanrResult(correlation=-0.12143202709240444, pvalue=0.3863969199686569) Adipose_Subcutaneous
SpearmanrResult(correlation=0.06297371391711014, pvalue=0.654175815544539) Adipose_Visceral_Omentum
SpearmanrResult(correlation=-0.09240445089501692, pvalue=0.5104843466093794) Adrenal_Gland
SpearmanrResult(correlation=0.450733752620545, pvalue=0.0007067593183282275) Artery_Aorta
SpearmanrResult(correlation=-0.18738913078535718, pvalue=0.17907111414844504) Artery_Coronary
SpearmanrResult(correlation=0.1569101757781003, pvalue=0.2618404134477018) Artery_Tibial
SpearmanrResult(correlation=0.17053701015965167, pvalue=0.2221273492483992) Bladder_right
SpearmanrResult(correlation=-0.14521851314304143, pvalue=0.2994983880203771) Brain_Amygdala
SpearmanrResult(correlation=0.033139816158684075, pvalue=0.8137638565846628) Brain_Anterior_cingulate_cortex_BA24
SpearmanrResult(correlation=-0.2845508788905015, pvalue=0.03891656299523723) Brain_Caudate_basal_ganglia
SpearmanrResult(correlation=0.3605870