# TCGA RPPA Data Importation
**Local Version**: 1
**Source Version**: NA

This notebook will import raw and normalized TCGA RPPA data through the [CGDS](http://www.cbioportal.org/cgds_r.jsp) portal.

Data for this type comes in a raw form and as Z-scores.  Both will be imported below.

Note: There is a separate endoint from CGDS called "getProteinArrayData" (as well as "getProteinArrayInfo") but neither of them seem to work.  However, the RPPA data does come back through getGeneticProfileData calls so that method will be used instead.

Example genetic profile data call for RPPA data (the genes are are the only ones that actually have data, as determined by a larger query for all of them): http://www.cbioportal.org/public-portal/webservice.do?cmd=getProfileData&genetic_profile_id=brca_tcga_pub2015_rppa_Zscores&gene_list=DIABLO,DIRAS3,DPP4,DVL3,EEF2,EEF2K,EGFR,EIF4E,EIF4EBP1,EIF4G1,ENY2,EPPK1,ERBB2,ERBB3,ERCC1,ERCC5,ERRFI1,ESR1,ETS1,FASN,FN1,FOXM1,FOXO3,G6PD,GAB2,GAPDH,GATA3,GATA6,HSPA1A,IGFBP2,INPP4B,IRF1,IRS1,ITGA2,JAK2,JUN,KAT2A,KDR,KIT,LCK,MAP2K1,MAPK1,MAPK14,MAPK8,MAPK9,MET,MRE11A,MS4A1,MSH2,MSH6&id_type=gene_symbol&case_set_id=brca_tcga_all

Example call to get protein array data (that doesn't work):
http://www.cbioportal.org/public-portal/webservice.do?cmd=getProteinArrayData&array_info=1&case_set_id=brca_tcga_all

In [1]:
%run -m ipy_startup
%run -m ipy_logging false
%matplotlib inline
from mgds.data_aggregation import database as db
from mgds.data_aggregation import source as src
from mgds.data_aggregation import data_type as dtyp
from mgds.data_aggregation import api
from mgds.data_aggregation.import_lib import cgds
from mgds.data_aggregation.import_lib import tcga
from py_utils.collection_utils import subset

In [3]:
norm_tables = tcga.import_genetic_profile_data(
    profile_fmt=tcga.PROF_FMT_RPPA_ZSCORE,
    data_type=dtyp.add_normalized_modifier(dtyp.GENE_RPPA),
    gene_ids=api.get_hugo_gene_ids()
)

2016-12-19 08:33:47,105:INFO:mgds.data_aggregation.import_lib.tcga: Importing data for study "acc_tcga" (1 of 32), cohort "acc", case list "acc_tcga_all", profile "acc_tcga_rppa_Zscores", table "acc-gene-rppa-normalized"
2016-12-19 08:33:47,106:DEBUG:mgds.data_aggregation.io_utils: Restoring serialized object from "/Users/eczech/data/research/mgds/raw/tcga_v1_acc-gene-rppa-normalized.pkl"
2016-12-19 08:33:47,170:INFO:mgds.data_aggregation.import_lib.tcga: Importing data for study "blca_tcga" (2 of 32), cohort "blca", case list "blca_tcga_all", profile "blca_tcga_rppa_Zscores", table "blca-gene-rppa-normalized"
2016-12-19 08:33:47,171:DEBUG:mgds.data_aggregation.io_utils: Restoring serialized object from "/Users/eczech/data/research/mgds/raw/tcga_v1_blca-gene-rppa-normalized.pkl"
2016-12-19 08:33:47,472:INFO:mgds.data_aggregation.import_lib.tcga: Importing data for study "brca_tcga" (3 of 32), cohort "brca", case list "brca_tcga_all", profile "brca_tcga_rppa_Zscores", table "brca-gene-r

In [None]:
raw_tables = tcga.import_genetic_profile_data(
    profile_fmt=tcga.PROF_FMT_RPPA,
    data_type=dtyp.GENE_RPPA,
    gene_ids=api.get_hugo_gene_ids()
)

In [None]:
d = tcga.load_genetic_profile_data(dtyp.add_normalized_modifier(dtyp.GENE_EXPRESSION), cohorts=['brca'])

In [4]:
d = cgds.melt_raw_data(d)
d.info()

[Remove null values for column "VALUE"] Records before = 31889144, Records after = 8808287, Records removed = 23080857 (%72.38)
<class 'pandas.core.frame.DataFrame'>
Int64Index: 8808287 entries, 0 to 31889143
Data columns (total 4 columns):
GENE_ID:CGDS    int64
GENE_ID:HGNC    object
CELL_LINE_ID    object
VALUE           float64
dtypes: float64(1), int64(1), object(2)
memory usage: 336.0+ MB


In [5]:
d_agg, d_dist = cgds.aggregate(d)
d_agg.head()

Unnamed: 0,CELL_LINE_ID,GENE_ID:HGNC,GENE_ID:CGDS,VALUE_MEAN,VALUE_STD,VALUE_CT
0,TCGA-A1-A0SB-01,A1BG,1,0.073395,0.0,1
1,TCGA-A1-A0SB-01,A1BG-AS1,503538,0.724501,0.0,1
2,TCGA-A1-A0SB-01,A2M,2,0.891226,0.0,1
3,TCGA-A1-A0SB-01,A2ML1,144568,0.4784,0.0,1
4,TCGA-A1-A0SB-01,A4GALT,53947,0.513391,0.0,1


In [6]:
d_dist

1    8785063
2      11612
Name: Number of Replicates, dtype: int64

In [7]:
d_agg.describe()

Unnamed: 0,GENE_ID:CGDS,VALUE_MEAN,VALUE_STD,VALUE_CT
count,8796675.0,8796675.0,8796675.0,8796675.0
mean,593048.0,0.407858,0.0,1.00132
std,7169097.0,0.3489149,0.0,0.03630843
min,1.0,0.003360104,0.0,1.0
25%,6774.0,0.04721479,0.0,1.0
50%,27347.0,0.3495343,0.0,1.0
75%,84662.0,0.7720465,0.0,1.0
max,100529100.0,0.9962133,0.0,2.0


In [8]:
assert np.all(pd.notnull(d_agg))
db.save(d_agg, src.TCGA_BREAST_v1, db.IMPORT, 'gene-methylation')

'/Users/eczech/data/research/mgds/import/tcga-breast_v1_gene-methylation.pkl'