# TCGA Breast Clinical Data Importation
**Local Version**: 1
**Source Version**: NA

This notebook will import raw TCGA clinical data through the [CGDS](http://www.cbioportal.org/cgds_r.jsp) portal for the study named "Breast Invasive Carcinoma (TCGA, Cell 2015)".

This study is preferred over "Breast Invasive Carcinoma (TCGA, Nature 2012)" despite the fact that it has a little fewer samples because it appears newer and includes more data types.

In [4]:
%run -m ipy_startup
%run -m ipy_logging false
%matplotlib inline
from mgds.data_aggregation import database as db
from mgds.data_aggregation import source as src
from mgds.data_aggregation import data_type as dtyp
from mgds.data_aggregation.import_lib import cgds
from mgds.data_aggregation.import_lib import tcga

In [32]:
import imp
imp.reload(tcga)

<module 'mgds.data_aggregation.import_lib.tcga' from '/Users/eczech/repos/mgds/python/src/mgds/data_aggregation/import_lib/tcga.py'>

In [33]:
tables = tcga.import_clinical_data(case_list_id_fmt=tcga.CASE_LIST_ID_FMT, cohorts=None)

2016-12-20 10:09:08,808:INFO:mgds.data_aggregation.import_lib.tcga: Importing data for study "acc_tcga" (1 of 32), cohort "acc", case list "acc_tcga_all", table "acc-cellline-meta"
2016-12-20 10:09:08,808:INFO:mgds.data_aggregation.import_lib.tcga: Importing data for study "acc_tcga" (1 of 32), cohort "acc", case list "acc_tcga_all", table "acc-cellline-meta"
2016-12-20 10:09:10,532:DEBUG:mgds.data_aggregation.io_utils: Writing serialized object to "/Users/eczech/data/research/mgds/raw/tcga_v1_acc-cellline-meta.pkl"
2016-12-20 10:09:10,532:DEBUG:mgds.data_aggregation.io_utils: Writing serialized object to "/Users/eczech/data/research/mgds/raw/tcga_v1_acc-cellline-meta.pkl"
2016-12-20 10:09:10,534:DEBUG:mgds.data_aggregation.io_utils: Restoring serialized object from "/Users/eczech/data/research/mgds/raw/tcga_v1_acc-cellline-meta.pkl"
2016-12-20 10:09:10,534:DEBUG:mgds.data_aggregation.io_utils: Restoring serialized object from "/Users/eczech/data/research/mgds/raw/tcga_v1_acc-cellline-

In [37]:
import imp
imp.reload(cgds)

<module 'mgds.data_aggregation.import_lib.cgds' from '/Users/eczech/repos/mgds/python/src/mgds/data_aggregation/import_lib/cgds.py'>

In [44]:
d = tcga.load_clinical_data(cohorts=None)
d = cgds.prep_clinical_data(d, keep_cols=['COHORT'])



In [45]:
d.info()

<class 'pandas.core.frame.DataFrame'>
Int64Index: 11312 entries, 0 to 79
Data columns (total 23 columns):
AGE                                     11089 non-null float64
CANCER_TYPE                             11112 non-null object
CANCER_TYPE_DETAILED                    11112 non-null object
CELL_LINE_ID                            11312 non-null object
COHORT                                  11312 non-null object
DAYS_TO_BIRTH                           11023 non-null float64
DAYS_TO_COLLECTION                      6632 non-null float64
DAYS_TO_DEATH                           2726 non-null float64
ETHNICITY                               8531 non-null object
GENDER                                  11139 non-null object
HISTOLOGICAL_DIAGNOSIS                  10444 non-null object
HISTOLOGICAL_SUBTYPE                    592 non-null object
INITIAL_PATHOLOGIC_DX_YEAR              10594 non-null float64
METHOD_OF_INITIAL_SAMPLE_PROCUREMENT    4189 non-null object
OTHER_PATIENT_ID           

In [53]:
d.groupby('COHORT')['CANCER_TYPE'].unique()

COHORT
acc                                [Adrenocortical Carcinoma]
blca                                         [Bladder Cancer]
brca                                          [Breast Cancer]
cesc                                        [Cervical Cancer]
chol                                   [Hepatobiliary Cancer]
coadread                                  [Colorectal Cancer]
dlbc                                   [Non-Hodgkin Lymphoma]
esca                                 [Esophagogastric Cancer]
gbm                                                  [Glioma]
hnsc                                   [Head and Neck Cancer]
kich                                   [Renal Cell Carcinoma]
kirc                                   [Renal Cell Carcinoma]
kirp                                   [Renal Cell Carcinoma]
laml                                                    [nan]
lgg                                                  [Glioma]
lihc                                   [Hepatobiliary Cancer]
l

# Site mappings

- acc - Adrenocortical Carcinoma -
- blca - Bladder Cancer - 
- brca - Breast Cancer - BREAST
- cesc - Cervical Cancer - 
- chol - Hepatobiliary Cancer -
- coadread - Colorectal Cancer - 
- dlbc - Non-Hodgkin Lymphoma - 
- esca - Esophagogastric Cancer - 
- gbm - Glioma -
- hnsc - Head and Neck Cancer - UPPER_AERODIGESTIVE_TRACT

TODO: Finish this

In [52]:
dc = db.load(src.CCLE_v1, db.IMPORT, dtyp.CELLLINE_META)
dc['PRIMARY_SITE'].value_counts()

LUNG                                  184
HAEMATOPOIETIC_AND_LYMPHOID_TISSUE    180
SKIN                                   62
LARGE_INTESTINE                        60
BREAST                                 59
CENTRAL_NERVOUS_SYSTEM                 55
OVARY                                  51
PANCREAS                               46
STOMACH                                38
KIDNEY                                 33
UPPER_AERODIGESTIVE_TRACT              33
BONE                                   29
ENDOMETRIUM                            28
URINARY_TRACT                          28
LIVER                                  28
OESOPHAGUS                             26
SOFT_TISSUE                            20
AUTONOMIC_GANGLIA                      17
THYROID                                12
PLEURA                                 11
PROSTATE                                8
BILIARY_TRACT                           8
SALIVARY_GLAND                          2
SMALL_INTESTINE                   

In [42]:
d[['COHORT', 'CANCER_TYPE']].drop_duplicates()

Unnamed: 0,COHORT,CANCER_TYPE
0,acc,Adrenocortical Carcinoma
0,blca,Bladder Cancer
0,brca,Breast Cancer
0,cellline,Breast Cancer
0,cesc,Cervical Cancer
0,chol,Hepatobiliary Cancer
0,coadread,Colorectal Cancer
0,dlbc,Non-Hodgkin Lymphoma
0,esca,Esophagogastric Cancer
0,gbm,Glioma
