# TCGA Breast Clinical Data Importation
**Local Version**: 1
**Source Version**: NA

This notebook will import raw TCGA clinical data through the [CGDS](http://www.cbioportal.org/cgds_r.jsp) portal for the study named "Breast Invasive Carcinoma (TCGA, Cell 2015)".

This study is preferred over "Breast Invasive Carcinoma (TCGA, Nature 2012)" despite the fact that it has a little fewer samples because it appears newer and includes more data types.

In [1]:
%run -m ipy_startup
%run -m ipy_logging false
%matplotlib inline
from mgds.data_aggregation import database as db
from mgds.data_aggregation import source as src
from mgds.data_aggregation import data_type as dtyp
from mgds.data_aggregation.import_lib import cgds
from mgds.data_aggregation.import_lib import tcga

In [32]:
import imp
imp.reload(tcga)

<module 'mgds.data_aggregation.import_lib.tcga' from '/Users/eczech/repos/mgds/python/src/mgds/data_aggregation/import_lib/tcga.py'>

In [2]:
tables = tcga.import_clinical_data(case_list_id_fmt=tcga.CASE_LIST_ID_FMT, cohorts=None)

2017-04-28 08:51:25,969:INFO:mgds.data_aggregation.import_lib.tcga: Importing data for study "acc_tcga" (1 of 32), cohort "acc", case list "acc_tcga_all", table "acc-cellline-meta"
2017-04-28 08:51:25,972:DEBUG:py_utils.io_utils: Restoring serialized object from location "/Users/eczech/data/research/mgds/raw/tcga_v1_acc-cellline-meta.pkl"
2017-04-28 08:51:25,984:INFO:mgds.data_aggregation.import_lib.tcga: Importing data for study "blca_tcga" (2 of 32), cohort "blca", case list "blca_tcga_all", table "blca-cellline-meta"
2017-04-28 08:51:25,985:DEBUG:py_utils.io_utils: Restoring serialized object from location "/Users/eczech/data/research/mgds/raw/tcga_v1_blca-cellline-meta.pkl"
2017-04-28 08:51:26,028:INFO:mgds.data_aggregation.import_lib.tcga: Importing data for study "brca_tcga" (3 of 32), cohort "brca", case list "brca_tcga_all", table "brca-cellline-meta"
2017-04-28 08:51:26,029:DEBUG:py_utils.io_utils: Restoring serialized object from location "/Users/eczech/data/research/mgds/raw

In [37]:
import imp
imp.reload(cgds)

<module 'mgds.data_aggregation.import_lib.cgds' from '/Users/eczech/repos/mgds/python/src/mgds/data_aggregation/import_lib/cgds.py'>

In [3]:
d = tcga.load_clinical_data(cohorts=None)
d = cgds.prep_clinical_data(d, keep_cols=['COHORT'])

In [4]:
d.info()

<class 'pandas.core.frame.DataFrame'>
Int64Index: 11312 entries, 0 to 79
Data columns (total 23 columns):
AGE                                     11089 non-null float64
CANCER_TYPE                             11112 non-null object
CANCER_TYPE_DETAILED                    11112 non-null object
CELL_LINE_ID                            11312 non-null object
COHORT                                  11312 non-null object
DAYS_TO_BIRTH                           11023 non-null float64
DAYS_TO_COLLECTION                      6632 non-null float64
DAYS_TO_DEATH                           2726 non-null float64
ETHNICITY                               8531 non-null object
GENDER                                  11139 non-null object
HISTOLOGICAL_DIAGNOSIS                  10444 non-null object
HISTOLOGICAL_SUBTYPE                    592 non-null object
INITIAL_PATHOLOGIC_DX_YEAR              10594 non-null float64
METHOD_OF_INITIAL_SAMPLE_PROCUREMENT    4189 non-null object
OTHER_PATIENT_ID           

In [5]:
d.head()

Unnamed: 0,AGE,CANCER_TYPE,CANCER_TYPE_DETAILED,CELL_LINE_ID,COHORT,DAYS_TO_BIRTH,DAYS_TO_COLLECTION,DAYS_TO_DEATH,ETHNICITY,GENDER,...,METHOD_OF_INITIAL_SAMPLE_PROCUREMENT,OTHER_PATIENT_ID,OTHER_SAMPLE_ID,PRIMARY_SITE,RACE,SAMPLE_TYPE,TISSUE_SOURCE_SITE,TUMOR_TISSUE_SITE,TUMOR_TYPE,VITAL_STATUS
0,27.0,Adrenocortical Carcinoma,Adrenocortical Carcinoma,TCGA-PK-A5H9-01,acc,-10173.0,92.0,,NOT HISPANIC OR LATINO,FEMALE,...,,83E7B9F8-04A4-440F-AAFC-7A8E9DDFF284,0A0A8E16-D3AF-4E98-802E-761966588E4F,,ASIAN,Primary Tumor,PK,Adrenal,,Alive
1,23.0,Adrenocortical Carcinoma,Adrenocortical Carcinoma,TCGA-OR-A5J3-01,acc,-8624.0,1584.0,,HISPANIC OR LATINO,FEMALE,...,,DFD687BC-6E69-42F7-AF94-D17FC150D1A1,1FB59B6F-53C0-4B14-82CC-77CD55C67AD6,,WHITE,Primary Tumor,OR,Adrenal,,Alive
2,65.0,Adrenocortical Carcinoma,Adrenocortical Carcinoma,TCGA-OR-A5JJ-01,acc,-24082.0,1308.0,,,MALE,...,,9EC86C06-E7A9-4EA6-B36D-6F832909054B,0BE64816-8299-49B2-B96C-CD0F4DA7D37B,,WHITE,Primary Tumor,OR,Adrenal,,Dead
3,23.0,Adrenocortical Carcinoma,Adrenocortical Carcinoma,TCGA-OR-A5LM-01,acc,-8735.0,1821.0,,,MALE,...,,E612EF6E-F3F2-44BE-AB15-8E4095687CD6,6F41AF90-80F5-4032-A323-59204CB30567,,,Primary Tumor,OR,Adrenal,,Alive
4,37.0,Adrenocortical Carcinoma,Adrenocortical Carcinoma,TCGA-OR-A5KU-01,acc,-13559.0,4543.0,,NOT HISPANIC OR LATINO,FEMALE,...,,424A497A-48B9-4507-B234-C4FD08C8ACAD,EFBCD3A5-723A-4361-9769-C01336F5F22D,,WHITE,Primary Tumor,OR,Adrenal,,Alive


In [7]:
d[d['CANCER_TYPE'] == 'Breast Cancer'].iloc[0]

AGE                                                                           62
CANCER_TYPE                                                        Breast Cancer
CANCER_TYPE_DETAILED                    Breast Invasive Mixed Mucinous Carcinoma
CELL_LINE_ID                                                     TCGA-A7-A3J0-01
COHORT                                                                      brca
DAYS_TO_BIRTH                                                             -22672
DAYS_TO_COLLECTION                                                            78
DAYS_TO_DEATH                                                                NaN
ETHNICITY                                                 NOT HISPANIC OR LATINO
GENDER                                                                    FEMALE
HISTOLOGICAL_DIAGNOSIS                                        Mucinous Carcinoma
HISTOLOGICAL_SUBTYPE                                                         NaN
INITIAL_PATHOLOGIC_DX_YEAR  

In [53]:
d.groupby('COHORT')['CANCER_TYPE'].unique()

COHORT
acc                                [Adrenocortical Carcinoma]
blca                                         [Bladder Cancer]
brca                                          [Breast Cancer]
cesc                                        [Cervical Cancer]
chol                                   [Hepatobiliary Cancer]
coadread                                  [Colorectal Cancer]
dlbc                                   [Non-Hodgkin Lymphoma]
esca                                 [Esophagogastric Cancer]
gbm                                                  [Glioma]
hnsc                                   [Head and Neck Cancer]
kich                                   [Renal Cell Carcinoma]
kirc                                   [Renal Cell Carcinoma]
kirp                                   [Renal Cell Carcinoma]
laml                                                    [nan]
lgg                                                  [Glioma]
lihc                                   [Hepatobiliary Cancer]
l

# Site mappings

- acc - Adrenocortical Carcinoma -
- blca - Bladder Cancer - 
- brca - Breast Cancer - BREAST
- cesc - Cervical Cancer - 
- chol - Hepatobiliary Cancer -
- coadread - Colorectal Cancer - 
- dlbc - Non-Hodgkin Lymphoma - 
- esca - Esophagogastric Cancer - 
- gbm - Glioma -
- hnsc - Head and Neck Cancer - UPPER_AERODIGESTIVE_TRACT

TODO: Finish this

In [52]:
dc = db.load(src.CCLE_v1, db.IMPORT, dtyp.CELLLINE_META)
dc['PRIMARY_SITE'].value_counts()

LUNG                                  184
HAEMATOPOIETIC_AND_LYMPHOID_TISSUE    180
SKIN                                   62
LARGE_INTESTINE                        60
BREAST                                 59
CENTRAL_NERVOUS_SYSTEM                 55
OVARY                                  51
PANCREAS                               46
STOMACH                                38
KIDNEY                                 33
UPPER_AERODIGESTIVE_TRACT              33
BONE                                   29
ENDOMETRIUM                            28
URINARY_TRACT                          28
LIVER                                  28
OESOPHAGUS                             26
SOFT_TISSUE                            20
AUTONOMIC_GANGLIA                      17
THYROID                                12
PLEURA                                 11
PROSTATE                                8
BILIARY_TRACT                           8
SALIVARY_GLAND                          2
SMALL_INTESTINE                   

In [42]:
d[['COHORT', 'CANCER_TYPE']].drop_duplicates()

Unnamed: 0,COHORT,CANCER_TYPE
0,acc,Adrenocortical Carcinoma
0,blca,Bladder Cancer
0,brca,Breast Cancer
0,cellline,Breast Cancer
0,cesc,Cervical Cancer
0,chol,Hepatobiliary Cancer
0,coadread,Colorectal Cancer
0,dlbc,Non-Hodgkin Lymphoma
0,esca,Esophagogastric Cancer
0,gbm,Glioma
