# GTEx Tissue-Level RNA-seq Data Importation
**Local Version**: 1
**Source Version**: 6p

This notebook will import raw GTEx rna-seq data through the [GTEx Data Portal](http://www.gtexportal.org/home/datasets).

Note that this information is not specific to samples or cell lines, instead it is an aggregate, tissue-specific determination of expression levels across genes.

In [12]:
%run -m ipy_startup
%run -m ipy_logging
%matplotlib inline
from mgds.data_aggregation import database as db
from mgds.data_aggregation import source as src
from mgds.data_aggregation import api
from mgds.data_aggregation.import_lib import gtex
from mgds.data_aggregation import io_utils
from py_utils.collection_utils import subset

In [15]:
filepath = db.raw_file(src.GTEX_v1, 'gene-agg-rna-seq.gz')
url = 'http://www.gtexportal.org/static/datasets/gtex_analysis_v6p/rna_seq_data/GTEx_Analysis_v6p_RNA-seq_RNA-SeQCv1.1.8_gene_median_rpkm.gct.gz'
filepath = io_utils.download(url, filepath)
filepath

2016-11-28 07:46:47,302:DEBUG:mgds.data_aggregation.io_utils: Returning previously downloaded path for "/Users/eczech/data/research/mgds/raw/gtex_v1_agg-rna-seq.gz"


'/Users/eczech/data/research/mgds/raw/gtex_v1_agg-rna-seq.gz'

In [16]:
d = pd.read_csv(filepath, sep='\t', skiprows=[0,1])
d = d.rename(columns={'Name': 'GENE_ID:ENSEMBL', 'Description': 'GENE_ID:HGNC'})
d.head()

Unnamed: 0,GENE_ID:ENSEMBL,GENE_ID:HGNC,Adipose - Subcutaneous,Adipose - Visceral (Omentum),Adrenal Gland,Artery - Aorta,Artery - Coronary,Artery - Tibial,Bladder,Brain - Amygdala,...,Skin - Not Sun Exposed (Suprapubic),Skin - Sun Exposed (Lower leg),Small Intestine - Terminal Ileum,Spleen,Stomach,Testis,Thyroid,Uterus,Vagina,Whole Blood
0,ENSG00000223972.4,DDX11L1,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.01884,0.0,0.8229,0.0,0.0,0.0,0.0615
1,ENSG00000227232.4,WASH7P,8.294,7.283,6.109,7.445,7.85,7.266,10.48,4.962,...,13.6,13.66,10.6,13.47,8.051,12.54,12.55,13.01,11.36,7.572
2,ENSG00000243485.2,MIR1302-11,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.1141,0.0,0.0,0.0,0.0
3,ENSG00000237613.2,FAM138A,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
4,ENSG00000268020.2,OR4G4P,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0


In [25]:
d.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 56238 entries, 0 to 56237
Data columns (total 55 columns):
GENE_ID:ENSEMBL                              56238 non-null object
GENE_ID:HGNC                                 56238 non-null object
Adipose - Subcutaneous                       56238 non-null float64
Adipose - Visceral (Omentum)                 56238 non-null float64
Adrenal Gland                                56238 non-null float64
Artery - Aorta                               56238 non-null float64
Artery - Coronary                            56238 non-null float64
Artery - Tibial                              56238 non-null float64
Bladder                                      56238 non-null float64
Brain - Amygdala                             56238 non-null float64
Brain - Anterior cingulate cortex (BA24)     56238 non-null float64
Brain - Caudate (basal ganglia)              56238 non-null float64
Brain - Cerebellar Hemisphere                56238 non-null float64
Brain - C

In [23]:
# At TOW records were duplicated for HGNC ids but not for Ensembl ids -- ensure that is still true
assert not np.any(d['GENE_ID:ENSEMBL'].duplicated())
assert np.any(d['GENE_ID:HGNC'].duplicated())

In [24]:
d_tr = pd.melt(d, id_vars=['GENE_ID:ENSEMBL', 'GENE_ID:HGNC'], var_name='TISSUE_TYPE', value_name='VALUE')
d_tr.head()

Unnamed: 0,GENE_ID:ENSEMBL,GENE_ID:HGNC,TISSUE_TYPE,VALUE
0,ENSG00000223972.4,DDX11L1,Adipose - Subcutaneous,0.0
1,ENSG00000227232.4,WASH7P,Adipose - Subcutaneous,8.294
2,ENSG00000243485.2,MIR1302-11,Adipose - Subcutaneous,0.0
3,ENSG00000237613.2,FAM138A,Adipose - Subcutaneous,0.0
4,ENSG00000268020.2,OR4G4P,Adipose - Subcutaneous,0.0


In [27]:
d_tr['TISSUE_TYPE'].value_counts().sort_index()

Adipose - Subcutaneous                       56238
Adipose - Visceral (Omentum)                 56238
Adrenal Gland                                56238
Artery - Aorta                               56238
Artery - Coronary                            56238
Artery - Tibial                              56238
Bladder                                      56238
Brain - Amygdala                             56238
Brain - Anterior cingulate cortex (BA24)     56238
Brain - Caudate (basal ganglia)              56238
Brain - Cerebellar Hemisphere                56238
Brain - Cerebellum                           56238
Brain - Cortex                               56238
Brain - Frontal Cortex (BA9)                 56238
Brain - Hippocampus                          56238
Brain - Hypothalamus                         56238
Brain - Nucleus accumbens (basal ganglia)    56238
Brain - Putamen (basal ganglia)              56238
Brain - Spinal cord (cervical c-1)           56238
Brain - Substantia nigra       

## Export

In [28]:
assert np.all(pd.notnull(d_tr))
db.save(d_tr, src.GTEX_v1, db.IMPORT, 'gene-agg-rna-seq')

'/Users/eczech/data/research/mgds/import/gtex_v1_gene-agg-rna-seq.pkl'