# TCGA Breast Copy Number Data Importation
**Local Version**: 1
**Source Version**: NA

This notebook will import raw TCGA copy number data through the [CGDS](http://www.cbioportal.org/cgds_r.jsp) portal for the study named "Breast Invasive Carcinoma (TCGA, Cell 2015)".

This study is preferred over "Breast Invasive Carcinoma (TCGA, Nature 2012)" despite the fact that it has a little fewer samples because it appears newer and includes more data types.

In [2]:
%run -m ipy_startup
%run -m ipy_logging
%matplotlib inline
from mgds.data_aggregation import database as db
from mgds.data_aggregation import source as src
from mgds.data_aggregation import api
from mgds.data_aggregation.import_lib import cgds
from mgds.data_aggregation.import_lib import tcga_breast
from py_utils.collection_utils import subset
pd.set_option('display.max_info_rows', 25000000)

In [3]:
case_list_id = tcga_breast.CASE_LIST_ID
genetic_profile_id = tcga_breast.PROF_COPY_NUMBER
batch_size = 50

op = lambda: cgds.get_genetic_profile_data(
    case_list_id, genetic_profile_id,
    api.get_hugo_gene_ids(), gene_id_batch_size=batch_size
)
d = db.cache_raw_operation(op, src.TCGA_BREAST_v1, 'gene-copy-number')

2016-11-19 23:38:02,303:INFO:mgds.data_aggregation.import_lib.cgds: Processing batch 1 of 789
2016-11-19 23:40:55,132:INFO:mgds.data_aggregation.import_lib.cgds: Processing batch 79 of 789
2016-11-19 23:45:24,141:INFO:mgds.data_aggregation.import_lib.cgds: Processing batch 157 of 789
2016-11-19 23:48:25,862:INFO:mgds.data_aggregation.import_lib.cgds: Processing batch 235 of 789
2016-11-19 23:51:30,414:INFO:mgds.data_aggregation.import_lib.cgds: Processing batch 313 of 789
2016-11-19 23:56:20,710:INFO:mgds.data_aggregation.import_lib.cgds: Processing batch 391 of 789
2016-11-19 23:58:49,424:INFO:mgds.data_aggregation.import_lib.cgds: Processing batch 469 of 789
2016-11-20 00:01:43,506:INFO:mgds.data_aggregation.import_lib.cgds: Processing batch 547 of 789
2016-11-20 00:04:42,509:INFO:mgds.data_aggregation.import_lib.cgds: Processing batch 625 of 789
2016-11-20 00:07:54,391:INFO:mgds.data_aggregation.import_lib.cgds: Processing batch 703 of 789
2016-11-20 00:11:10,535:INFO:mgds.data_aggr

In [14]:
d.info()

<class 'pandas.core.frame.DataFrame'>
Int64Index: 18063792 entries, 0 to 31850111
Data columns (total 4 columns):
GENE_ID:CGDS    18063792 non-null int64
GENE_ID:HGNC    18063792 non-null object
CELL_LINE_ID    18063792 non-null object
VALUE           18063792 non-null float64
dtypes: float64(1), int64(1), object(2)
memory usage: 689.1+ MB


In [7]:
d = cgds.melt_raw_data(d)
d.info()

[Remove null values for column "VALUE"] Records before = 31889144, Records after = 18063792, Records removed = 13825352 (%43.35)
<class 'pandas.core.frame.DataFrame'>
Int64Index: 18063792 entries, 0 to 31850111
Data columns (total 4 columns):
GENE_ID:CGDS    int64
GENE_ID:HGNC    object
CELL_LINE_ID    object
VALUE           float64
dtypes: float64(1), int64(1), object(2)
memory usage: 689.1+ MB


In [10]:
d_agg, d_dist = cgds.aggregate(d)
d_agg.head()

Unnamed: 0,CELL_LINE_ID,GENE_ID:HGNC,GENE_ID:CGDS,VALUE_STD,VALUE_MEAN
0,TCGA-A1-A0SB-01,A1BG,1,0.0,0.005
1,TCGA-A1-A0SB-01,A1CF,29974,0.0,-0.001
2,TCGA-A1-A0SB-01,A2M,2,0.0,-0.002
3,TCGA-A1-A0SB-01,A2ML1,144568,0.0,-0.002
4,TCGA-A1-A0SB-01,A2MP1,3,0.0,-0.002


In [12]:
d_dist

1    18007488
2       24480
3        2448
Name: Number of Replicates, dtype: int64

In [11]:
d_agg.describe()

Unnamed: 0,GENE_ID:CGDS,VALUE_STD,VALUE_MEAN
count,18034420.0,18034416.0,18034420.0
mean,8703688.0,0.0,0.03469139
std,28103300.0,0.0,0.4305323
min,1.0,0.0,-1.293
25%,8427.0,0.0,-0.068
50%,55699.0,0.0,0.0
75%,197259.0,0.0,0.062
max,105371600.0,0.0,3.657


In [16]:
import imp
imp.reload(src)
src.TCGA_BREAST_v1

'tcga-breast_v1'

In [17]:
assert np.all(pd.notnull(d_agg))
db.save(d, src.TCGA_BREAST_v1, db.IMPORT, 'gene-copy-number')

'/Users/eczech/data/research/mgds/import/tcga-breast_v1_gene-copy-number.pkl'