# NCI60 Copy Number Data Importation
**Local Version**: 2
**Source Version**: NA

This notebook will import raw NCI60 copy number data using the [CGDS](http://www.cbioportal.org/cgds_r.jsp) (aka "Cancer Genomic Data Server") portal.

In [1]:
%run -m ipy_startup
%run -m ipy_logging
from mgds.data_aggregation import database as db
from mgds.data_aggregation import source as src
from mgds.data_aggregation import api
from mgds.data_aggregation.import_lib import cgds
from mgds.data_aggregation.import_lib import nci60
pd.set_option('display.max_info_rows', 25000000)

In [11]:
case_list_id = nci60.CASE_LIST_ID
genetic_profile_id = nci60.PROF_COPY_NUMBER
batch_size = 50

op = lambda: cgds.get_genetic_profile_data(
    case_list_id, genetic_profile_id,
    api.get_hugo_gene_ids(), gene_id_batch_size=batch_size
)
d = db.cache_raw_operation(op, src.NCI60_v2, 'gene-copy-number', overwrite=False)

2016-11-18 11:32:27,001:DEBUG:mgds.data_aggregation.io_utils: Restoring serialized object from "/Users/eczech/data/research/mgds/raw/nci60_v2_gene-copy-number.pkl"


In [12]:
d.head()

Unnamed: 0,GENE_ID,COMMON,BT_549,HS578T,MCF7,MDA_MB_231,T47D,SF_268,SF_295,SF_539,...,DU_145,PC_3,786_0,A498,ACHN,CAKI_1,RXF_393,SN12C,TK_10,UO_31
0,1,A1BG,0.1366,0.1495,0.0657,0.1245,-0.2171,,,,...,-0.1764,0.3076,,0.1966,-0.0623,-0.1392,0.2353,-0.0574,-0.0112,-0.0056
1,503538,A1BG-AS1,,,,,,,,,...,,,,,,,,,,
2,29974,A1CF,-0.018,-0.1882,0.0137,0.072,0.3061,,,,...,0.1029,0.3781,,0.1642,0.2257,0.0535,-0.0328,0.0197,0.0487,-0.0377
3,2,A2M,-0.0935,-0.1803,0.0024,-0.1653,-0.6096,,,,...,-0.2178,-0.2593,,-0.1126,0.314,0.0463,0.0202,-0.2604,0.0738,0.1439
4,144571,A2M-AS1,,,,,,,,,...,,,,,,,,,,


In [16]:
d = cgds.melt_raw_data(d)
d.info()

[Remove null values for column "VALUE"] Records before = 2341920, Records after = 1240518, Records removed = 1101402 (%47.03)
<class 'pandas.core.frame.DataFrame'>
Int64Index: 1240518 entries, 0 to 2341919
Data columns (total 4 columns):
GENE_ID:CGDS    1240518 non-null int64
GENE_ID:HGNC    1240518 non-null object
CELL_LINE_ID    1240518 non-null object
VALUE           1240518 non-null float64
dtypes: float64(1), int64(1), object(2)
memory usage: 47.3+ MB


In [38]:
d_agg, d_dist = cgds.aggregate(d)
d_dist

1    1236543
2       1749
3        159
Name: Number of Replicates, dtype: int64

In [39]:
d_agg.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 1238451 entries, 0 to 1238450
Data columns (total 5 columns):
CELL_LINE_ID    1238451 non-null object
GENE_ID:HGNC    1238451 non-null object
GENE_ID:CGDS    1238451 non-null int64
VALUE_STD       1238451 non-null float64
VALUE_MEAN      1238451 non-null float64
dtypes: float64(2), int64(1), object(2)
memory usage: 47.2+ MB


In [37]:
assert np.all(pd.notnull(d_agg))
db.save(d_agg, src.NCI60_v2, db.IMPORT, 'gene-copy-number')

'/Users/eczech/data/research/mgds/import/nci60_v2_gene-copy-number.pkl'