# CCLE Raw Gene Copy Number Data Importation
**Local Version**: 1
**Source Version**: NA

This notebook will import raw CCLE copy number data through the [CGDS](http://www.cbioportal.org/cgds_r.jsp) (aka "Cancer Genomic Data Server" portal.  This should not be confused with the [GDSC](http://www.cancerrxgene.org/) portal which is a separate data source entirely.

In [1]:
%run -m ipy_startup
%run -m ipy_logging
from mgds.data_aggregation import database as db
from mgds.data_aggregation import source as src
from mgds.data_aggregation import api
from mgds.data_aggregation.import_lib import ccle
from mgds.data_aggregation.import_lib import cgds
pd.set_option('display.max_info_rows', 25000000)

In [2]:
case_list_id = ccle.CASE_LIST_ID
genetic_profile_id = ccle.PROF_COPY_NUMBER 
batch_size = 50

op = lambda: cgds.get_genetic_profile_data(
    case_list_id, genetic_profile_id,
    api.get_hugo_gene_ids(), gene_id_batch_size=batch_size
)
d = db.cache_raw_operation(op, src.CCLE_v1, 'gene-copy-number')

2016-11-18 11:55:42,049:DEBUG:mgds.data_aggregation.io_utils: Restoring serialized object from "/Users/eczech/data/research/mgds/raw/ccle_v1_gene-copy-number.pkl"


In [3]:
d.info()

<class 'pandas.core.frame.DataFrame'>
Int64Index: 39032 entries, 0 to 7
Columns: 1021 entries, GENE_ID to UOK101_KIDNEY
dtypes: float64(1019), int64(1), object(1)
memory usage: 304.3+ MB


In [4]:
d = cgds.melt_raw_data(d)
d.info()

[Remove null values for column "VALUE"] Records before = 39773608, Records after = 20313920, Records removed = 19459688 (%48.93)
<class 'pandas.core.frame.DataFrame'>
Int64Index: 20313920 entries, 0 to 39773607
Data columns (total 4 columns):
GENE_ID:CGDS    20313920 non-null int64
GENE_ID:HGNC    20313920 non-null object
CELL_LINE_ID    20313920 non-null object
VALUE           20313920 non-null float64
dtypes: float64(1), int64(1), object(2)
memory usage: 774.9+ MB


In [6]:
d_agg, d_dist = cgds.aggregate(d)
d_agg.head()

Unnamed: 0,CELL_LINE_ID,GENE_ID:HGNC,GENE_ID:CGDS,VALUE_MEAN,VALUE_STD
0,1321N1_CENTRAL_NERVOUS_SYSTEM,A1BG,1,-0.1544,0.0
1,1321N1_CENTRAL_NERVOUS_SYSTEM,A1BG-AS1,503538,-0.1544,0.0
2,1321N1_CENTRAL_NERVOUS_SYSTEM,A1CF,29974,-0.0985,0.0
3,1321N1_CENTRAL_NERVOUS_SYSTEM,A2M,2,-0.1819,0.0
4,1321N1_CENTRAL_NERVOUS_SYSTEM,A2ML1,144568,-0.1819,0.0


In [7]:
d_agg.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 20279095 entries, 0 to 20279094
Data columns (total 5 columns):
CELL_LINE_ID    20279095 non-null object
GENE_ID:HGNC    20279095 non-null object
GENE_ID:CGDS    20279095 non-null int64
VALUE_MEAN      20279095 non-null float64
VALUE_STD       20279095 non-null float64
dtypes: float64(2), int64(1), object(2)
memory usage: 773.6+ MB


In [8]:
d_dist

1    20248250
2       26865
3        3980
Name: Number of Replicates, dtype: int64

In [9]:
assert np.all(pd.notnull(d))
db.save(d, src.CCLE_v1, db.IMPORT, 'gene-copy-number')

'/Users/eczech/data/research/mgds/import/ccle_v1_gene-copy-number.pkl'