# TCGA Breast Expression Data Importation
**Local Version**: 1
**Source Version**: NA

This notebook will import raw TCGA gene expression data through the [CGDS](http://www.cbioportal.org/cgds_r.jsp) portal for the study named "Breast Invasive Carcinoma (TCGA, Cell 2015)".

This study is preferred over "Breast Invasive Carcinoma (TCGA, Nature 2012)" despite the fact that it has a little fewer samples because it appears newer and includes more data types.

In [1]:
%run -m ipy_startup
%run -m ipy_logging
%matplotlib inline
from mgds.data_aggregation import database as db
from mgds.data_aggregation import source as src
from mgds.data_aggregation import api
from mgds.data_aggregation.import_lib import cgds
from mgds.data_aggregation.import_lib import tcga_breast
from py_utils.collection_utils import subset
pd.set_option('display.max_info_rows', 25000000)

In [2]:
case_list_id = tcga_breast.CASE_LIST_ID
genetic_profile_id = tcga_breast.PROF_GENE_EXPRESSION
batch_size = 50

op = lambda: cgds.get_genetic_profile_data(
    case_list_id, genetic_profile_id,
    api.get_hugo_gene_ids(), gene_id_batch_size=batch_size
)
d = db.cache_raw_operation(op, src.TCGA_BREAST_v1, 'gene-expression')

2016-11-24 08:28:14,930:DEBUG:mgds.data_aggregation.io_utils: Restoring serialized object from "/Users/eczech/data/research/mgds/raw/tcga-breast_v1_gene-expression.pkl"


In [3]:
d.info()

<class 'pandas.core.frame.DataFrame'>
Int64Index: 39032 entries, 0 to 7
Columns: 819 entries, GENE_ID to TCGA-BH-A1ES-06
dtypes: float64(589), int64(1), object(229)
memory usage: 244.2+ MB


In [4]:
d = cgds.melt_raw_data(d)
d.info()

[Remove null values for column "VALUE"] Records before = 31889144, Records after = 7193724, Records removed = 24695420 (%77.44)
<class 'pandas.core.frame.DataFrame'>
Int64Index: 7193724 entries, 117096 to 31850111
Data columns (total 4 columns):
GENE_ID:CGDS    7193724 non-null int64
GENE_ID:HGNC    7193724 non-null object
CELL_LINE_ID    7193724 non-null object
VALUE           7193724 non-null float64
dtypes: float64(1), int64(1), object(2)
memory usage: 274.4+ MB


In [5]:
d_agg, d_dist = cgds.aggregate(d)
d_agg.head()

Unnamed: 0,CELL_LINE_ID,GENE_ID:HGNC,GENE_ID:CGDS,VALUE_CT,VALUE_STD,VALUE_MEAN
0,TCGA-A1-A0SD-01,A1BG,1,1,0.0,0.949333
1,TCGA-A1-A0SD-01,A2M,2,1,0.0,0.242
2,TCGA-A1-A0SD-01,A2ML1,144568,1,0.0,0.4235
3,TCGA-A1-A0SD-01,A3GALT2,127550,1,0.0,-0.126
4,TCGA-A1-A0SD-01,A4GALT,53947,1,0.0,1.128


In [6]:
d_agg.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 7181945 entries, 0 to 7181944
Data columns (total 6 columns):
CELL_LINE_ID    7181945 non-null object
GENE_ID:HGNC    7181945 non-null object
GENE_ID:CGDS    7181945 non-null int64
VALUE_CT        7181945 non-null int64
VALUE_STD       7181945 non-null float64
VALUE_MEAN      7181945 non-null float64
dtypes: float64(2), int64(2), object(2)
memory usage: 328.8+ MB


In [7]:
d_dist

1    7170587
2      10937
3        421
Name: Number of Replicates, dtype: int64

In [8]:
d_agg.describe()

Unnamed: 0,GENE_ID:CGDS,VALUE_CT,VALUE_STD,VALUE_MEAN
count,7181945.0,7181945.0,7181945.0,7181945.0
mean,255134.5,1.00164,0.0,0.009988264
std,4346312.0,0.04188835,0.0,1.341043
min,1.0,1.0,0.0,-11.785
25%,6494.0,1.0,0.0,-0.62475
50%,26235.0,1.0,0.0,0.026
75%,83902.0,1.0,0.0,0.6246667
max,102723500.0,3.0,0.0,14.207


In [9]:
assert np.all(pd.notnull(d_agg))
db.save(d_agg, src.TCGA_BREAST_v1, db.IMPORT, 'gene-expression')

'/Users/eczech/data/research/mgds/import/tcga-breast_v1_gene-expression.pkl'