# GTEx V7 RNA-Seq data processing
This notebook converts RNA-Seq data to tissue specific tables and save to HDF5 format, which can be accessed from R via `rhdf5` library.

**update**: this will later become part of the GTEx eQTL pipeline.

## Data conversion

In [1]:
import pandas as pd
import numpy as np
import re, os
def load_data(fdata, fsample, dtype = np.float32):
    '''First col of expression data is ENCODE gene name, 2nd col is HUGO name'''
    head = pd.read_csv(fdata, skiprows = 2, sep = '\t', nrows = 1)
    dt = {'Description': str, 'Name': str}
    dt.update({x: dtype for x in head.columns if x not in dt})
    data = pd.read_csv(fdata, compression='gzip', skiprows=2, 
                       index_col=0, header=0, dtype = dt, sep='\t').drop('Description', 1)
    samples = pd.read_csv(fsample, dtype=str, delimiter='\t', header=0)
    sample_dict = {}
    for row in samples[['SAMPID', 'SMTSD']].values:
        if row[1] not in sample_dict:
            sample_dict[row[1]] = []
        if row[0] in data.columns:
            sample_dict[row[1]].append(row[0])
    return data, dict((re.sub("[\W\d]+", "_", k.strip()).strip('_'), v) for k, v in sample_dict.items() if len(v))

In [2]:
fdata = 'data1/rna-seq/GTEx_Data_2016-01-15_v7_RNA-seq_RNA-SeQCv1.1.8_gene_reads.gct.gz'
fsample = 'data1/sample_annotations/GTEx_Analysis_2016-01-15_v7_SampleAttributesDS.txt'
fout = 'data1/rna-seq/GTEx_v7_RNA-seq_RNA-SeQCv1.1.8_gene_rpkm.gct.hdf5'
data, sample = load_data(fdata, fsample, dtype = np.uint32)
data = {k: data.loc[:, sample[k]] for k in sample}

In [3]:
if os.path.isfile(fout):
    os.remove(fout)
for k in data:
    data[k].to_hdf(fout, k, mode = 'a', complevel = 9, complib = 'zlib')

## Load / use data in R

In [4]:
# source("http://bioconductor.org/biocLite.R")
# biocLite("rhdf5")
library(rhdf5)
fdata = 'data1/rna-seq/GTEx_v7_RNA-seq_RNA-SeQCv1.1.8_gene_rpkm.gct.hdf5'
meta = h5ls(fdata)
groups = unique(meta$group)
groups = groups[which(groups != '/')]
groups

In [6]:
mydata <- h5read(fdata, "/Lung")
str(mydata)

List of 4
 $ axis0        : chr [1:474(1d)] "GTEX-111CU-0326-SM-5GZXO" "GTEX-111FC-1126-SM-5GZWU" "GTEX-111VG-0726-SM-5GIDC" "GTEX-111YS-0626-SM-5GZXV" ...
 $ axis1        : chr [1:56202(1d)] "ENSG00000223972.4" "ENSG00000227232.4" "ENSG00000243485.2" "ENSG00000237613.2" ...
 $ block0_items : chr [1:474(1d)] "GTEX-111CU-0326-SM-5GZXO" "GTEX-111FC-1126-SM-5GZWU" "GTEX-111VG-0726-SM-5GIDC" "GTEX-111YS-0626-SM-5GZXV" ...
 $ block0_values: int [1:474, 1:56202] 4 5 4 1 5 5 1 3 0 1 ...


So the group specific data object `mydata` has attribute `axis0` for sample names, `axis1` for gene names, and `block0_values` the 474 * 56202 data matrix. One can make `t(block0_values)` a separate matrix and set its rownames to `axis1` and colnames to `axis0`.