# GTEx RNA-seq Data Importation
**Local Version**: 1
**Source Version**: 6p

This notebook will import per-sample, raw GTEx rna-seq data through the [GTEx Data Portal](http://www.gtexportal.org/home/datasets).

Note that the units for these values are "RPKM".  There are different conventions for quantifying expression via RNA-seq and RPKM is one as is FPKM, the values used in the NCI-DREAM data.

See [here](http://www.rna-seqblog.com/rpkm-fpkm-and-tpm-clearly-explained/) for a discussion on some of the differences.

In [1]:
%run -m ipy_startup
%run -m ipy_logging
%matplotlib inline
%load_ext Cython
from mgds.data_aggregation import database as db
from mgds.data_aggregation import source as src
from mgds.data_aggregation import api
from mgds.data_aggregation.import_lib import gtex
from mgds.data_aggregation import io_utils
from py_utils.collection_utils import subset

In [3]:
filepath = db.raw_file(src.GTEX_v1, 'gene-rna-seq.gz')
url = 'http://www.gtexportal.org/static/datasets/gtex_analysis_v6p/rna_seq_data/GTEx_Analysis_v6p_RNA-seq_RNA-SeQCv1.1.8_gene_rpkm.gct.gz'
filepath = io_utils.download(url, filepath)
filepath

2016-11-28 11:01:53,897:DEBUG:mgds.data_aggregation.io_utils: Returning previously downloaded path for "/Users/eczech/data/research/mgds/raw/gtex_v1_rna-seq.gz"


'/Users/eczech/data/research/mgds/raw/gtex_v1_rna-seq.gz'

In [4]:
def melt_df(d):
    d = d.rename(columns={'Name': 'GENE_ID:ENSEMBL', 'Description': 'GENE_ID:HGNC'})
    d = pd.melt(d, id_vars=['GENE_ID:ENSEMBL', 'GENE_ID:HGNC'], var_name='CELL_LINE_ID:GTEX', value_name='VALUE')
    return d[d['VALUE'].notnull()]

d_part = []
n_lines = 56241 # gzip -dc /Users/eczech/data/research/mgds/raw/gtex_v1_rna-seq.gz | wc -l
n_chunk = 25
chunk_size = int(n_lines / float(n_chunk))
for i, df in enumerate(pd.read_csv(filepath, sep='\t', skiprows=[0,1], na_values=['0'], chunksize=chunk_size)):
    print('Processing chunk {} of {}'.format(i + 1, n_chunk + 1))
    d_part.append(melt_df(df))
print('Done')

Processing chunk 1 of 25
Processing chunk 2 of 25
Processing chunk 3 of 25
Processing chunk 4 of 25
Processing chunk 5 of 25
Processing chunk 6 of 25
Processing chunk 7 of 25
Processing chunk 8 of 25
Processing chunk 9 of 25
Processing chunk 10 of 25
Processing chunk 11 of 25
Processing chunk 12 of 25
Processing chunk 13 of 25
Processing chunk 14 of 25
Processing chunk 15 of 25
Processing chunk 16 of 25
Processing chunk 17 of 25
Processing chunk 18 of 25
Processing chunk 19 of 25
Processing chunk 20 of 25
Processing chunk 21 of 25
Processing chunk 22 of 25
Processing chunk 23 of 25
Processing chunk 24 of 25
Processing chunk 25 of 25
Processing chunk 26 of 25
Done


In [6]:
d = pd.concat(d_part)
d.info()

<class 'pandas.core.frame.DataFrame'>
Int64Index: 233807970 entries, 1 to 111212
Data columns (total 4 columns):
GENE_ID:ENSEMBL      object
GENE_ID:HGNC         object
CELL_LINE_ID:GTEX    object
VALUE                float64
dtypes: float64(1), object(3)
memory usage: 8.7+ GB


In [8]:
(d['VALUE'] == 0.).value_counts()

False    233807970
Name: VALUE, dtype: int64

In [9]:
assert np.all(d.notnull())

In [11]:
d[['GENE_ID:ENSEMBL', 'GENE_ID:HGNC', 'CELL_LINE_ID:GTEX']].duplicated().value_counts()

False    233807970
dtype: int64

## Export

In [None]:
# :TODO:
# Not sure what to do with this yet .. it's too big to use directly
# It may make sense to create per-gene, per-tissue type distributions and use those instead of raw data (as priors)

# assert np.all(pd.notnull(d))
# db.save(d, src.GTEX_v1, db.IMPORT, 'gene-rna-seq')