# TCGA RNA-Seq Data Importation
**Local Version**: 1
**Source Version**: NA

This notebook will import normalized TCGA RNA-Seq v2 data through the [CGDS](http://www.cbioportal.org/cgds_r.jsp) portal.

Note that both zscores and raw values are available for RNA-Seq v2 data but only the normalized zscores are imported here.

In [1]:
%run -m ipy_startup
%run -m ipy_logging false
%matplotlib inline
from mgds.data_aggregation import database as db
from mgds.data_aggregation import source as src
from mgds.data_aggregation import data_type as dtyp
from mgds.data_aggregation import api
from mgds.data_aggregation.import_lib import cgds
from mgds.data_aggregation.import_lib import tcga
from py_utils.collection_utils import subset
pd.set_option('display.max_info_rows', 25000000)

In [2]:
tables = tcga.import_genetic_profile_data(
    profile_fmt=tcga.PROF_FMT_RNASEQ_ZSCORE,
    data_type=dtyp.add_normalized_modifier(dtyp.GENE_RNA_SEQ),
    gene_ids=api.get_hugo_gene_ids(),
    cohorts=['brca']
)

2016-12-20 09:08:06,116:INFO:mgds.data_aggregation.import_lib.tcga: Importing data for study "brca_tcga" (3 of 32), cohort "brca", case list "brca_tcga_all", profile "brca_tcga_rna_seq_v2_mrna_median_Zscores", table "brca-gene-rna-seq-normalized"
2016-12-20 09:08:06,118:INFO:mgds.data_aggregation.import_lib.cgds: Processing batch 1 of 789
2016-12-20 09:11:46,386:INFO:mgds.data_aggregation.import_lib.cgds: Processing batch 79 of 789
2016-12-20 09:15:12,652:INFO:mgds.data_aggregation.import_lib.cgds: Processing batch 157 of 789
2016-12-20 09:18:26,914:INFO:mgds.data_aggregation.import_lib.cgds: Processing batch 235 of 789
2016-12-20 09:21:43,546:INFO:mgds.data_aggregation.import_lib.cgds: Processing batch 313 of 789
2016-12-20 09:24:39,138:INFO:mgds.data_aggregation.import_lib.cgds: Processing batch 391 of 789
2016-12-20 09:28:00,319:INFO:mgds.data_aggregation.import_lib.cgds: Processing batch 469 of 789
2016-12-20 09:31:15,719:INFO:mgds.data_aggregation.import_lib.cgds: Processing batch

In [4]:
d.info()

<class 'pandas.core.frame.DataFrame'>
Int64Index: 39032 entries, 0 to 7
Columns: 819 entries, GENE_ID to TCGA-BH-A1ES-06
dtypes: float64(817), int64(1), object(1)
memory usage: 244.2+ MB


In [5]:
d = cgds.melt_raw_data(d)
d.info()

[Remove null values for column "VALUE"] Records before = 31889144, Records after = 18063792, Records removed = 13825352 (%43.35)
<class 'pandas.core.frame.DataFrame'>
Int64Index: 18063792 entries, 0 to 31850111
Data columns (total 4 columns):
GENE_ID:CGDS    18063792 non-null int64
GENE_ID:HGNC    18063792 non-null object
CELL_LINE_ID    18063792 non-null object
VALUE           18063792 non-null float64
dtypes: float64(1), int64(1), object(2)
memory usage: 689.1+ MB


In [6]:
d_agg, d_dist = cgds.aggregate(d)
d_agg.head()

Unnamed: 0,CELL_LINE_ID,GENE_ID:HGNC,GENE_ID:CGDS,VALUE_CT,VALUE_MEAN,VALUE_STD
0,TCGA-A1-A0SB-01,A1BG,1,1,0.005,0.0
1,TCGA-A1-A0SB-01,A1CF,29974,1,-0.001,0.0
2,TCGA-A1-A0SB-01,A2M,2,1,-0.002,0.0
3,TCGA-A1-A0SB-01,A2ML1,144568,1,-0.002,0.0
4,TCGA-A1-A0SB-01,A2MP1,3,1,-0.002,0.0


In [7]:
d_dist

1    18007488
2       24480
3        2448
Name: Number of Replicates, dtype: int64

In [8]:
d_agg.describe()

Unnamed: 0,GENE_ID:CGDS,VALUE_CT,VALUE_MEAN,VALUE_STD
count,18034420.0,18034420.0,18034420.0,18034416.0
mean,8703688.0,1.001629,0.03469139,0.0
std,28103300.0,0.04356275,0.4305323,0.0
min,1.0,1.0,-1.293,0.0
25%,8427.0,1.0,-0.068,0.0
50%,55699.0,1.0,0.0,0.0
75%,197259.0,1.0,0.062,0.0
max,105371600.0,3.0,3.657,0.0


In [10]:
assert np.all(pd.notnull(d_agg))
db.save(d_agg, src.TCGA_BREAST_v1, db.IMPORT, 'gene-copy-number')

'/Users/eczech/data/research/mgds/import/tcga-breast_v1_gene-copy-number.pkl'