In [1]:
%matplotlib inline
import matplotlib
import seaborn as sns
matplotlib.rcParams['savefig.dpi'] = 144

# Identifying Driver Mutations that Predict Prognosis in Squamous Cell Carcinomas Using TCGA Data

## Data Incubator Capstone, Summer 2017, Eric Jaehnig

### Introduction

The Cancer Genome Atlas (TCGA) contains data from genomic (i.e., mutations, copy number aberrations), transcriptomic (gene expression), and proteomic (protein levels) analysis of human cancer samples from 150 studies (see http://www.cbioportal.org/faq.jsp). Associated with these large biological datasets are various types of clinical data describing characteristics of the patients from which the samples were obtained. For this project, I propose to use mutation data, the most abundant data type in TCGA, to identify potential cancer driver mutations for squamous cell carcinomas. Of the 150 datasets deposited in TCGA, 11 include mutations identified from whole genome or whole exome sequencing of 5 different types of squamous cell cell carcinomas from 2200 cancer patients. My goal is to determine if there is a set of common genes that is mutated in squamous cell cancers and to identify the cellular processes and pathways that these genes regulate. Since these genes likely drive carcinogenesis in squamous epithelial cells, drugs that target the pathways they are involved in could potentially be used to treat squamous cell cancers in the future. While a quick Pubmed search reveals that specific squamous cell carcinoma or broad pan-cancer analysis of TCGA data is common, I only found one study focused on one gene that involved pan-cancer analysis specifically for squamous cell cancers, lending credence to the novelty of the proposed project.

#### Data sources and preliminary analysis

I downloaded the datasets for each of the 5 squamous cell carcinomas: cervical, esophageal, head and neck, lung, and skin. I loaded the “data_mutations_extended.txt” files for each of the 11 datasets into R dataframes. Each row in these files contains a uniquely identified mutation (multiple mutations per patient, each of which occupies a separate row). I selected only columns that may be of potential use when populating the dataframes, including gene names, the specific nucleotide changes, the corresponding amino acid changes, and the impact that these changes are predicted to have on the gene’s function. Since some mutations are predicted to have little or no effect on function, I focused on genes for which the predicted impact is “medium” or “high”. I then determined the set of genes that were mutated for each of the 5 squamous cell cancers. 

Since cancer is marked by genomic instability, which can lead to passenger mutations in several genes that don’t contribute carcinogenesis, I decided to focus on genes that were more likely to be driver mutations that promote cancer progression. To do this, I generated the Venn diagram shown in Plot 1. This diagram shows how the 5 sets of genes overlap. My rationale for focusing on the set of genes that was mutated in all of the squamous cell carcinomas has two components. The first is that passenger mutations are more likely to be random, while driver mutations are more likely to occur more consistently; thus, focusing on genes mutated in all squamous cell carcinomas is a strategy for weeding out passenger mutations. The second is that my goal is to identify the mutated genes that drive squamous cell carcinogenesis in general, not specific types of squamous cell carcinomas. I found that there were approximately 14,000 unique genes mutated in lung and head and neck cancer, ~12,000 in cervical cancer, ~4500 in esophageal cancer, and ~500 mutated in skin cancer. Of these genes, only 193 were mutated in all 5 cancers (center of Plot 1, download pdf here: https://agile-ocean-41073.herokuapp.com/).

Next, I evaluated this set of genes for cellular pathway (KEGG) enrichment because mutation of multiple genes in the same pathway suggests that the pathway plays an important role in carcinogenesis. 89 KEGG pathways showed statistically significant enrichment for genes mutated in all of the squamous cell cancers. I chose to analyze one of these, the second most significant pathway, more closely by plotting the mutated genes to pathway in Plot 2 (https://limitless-stream-36820.herokuapp.com/). The plot shows all of the genes in the pathway, EGFR tyrosine kinase inhibitor resistance, with the genes mutated in squamous cell cancers colored red. While most of these genes already have established roles in cancer, the fact that squamous cell cancers show such strong enrichment for mutations in genes involved in resistance to EGFR inhibitors suggests that these drugs may have limited efficacy in treating squamous cell carcinomas in the long-term. 

#### Proposed strategy for data analysis

More informative analysis of these genes is possible when considering the characteristics of the patient population, which is what I propose to do for the fellowship. Each dataset also contains a “data_clinical.txt” file that includes each patient’s age, sex, and disease status. Since the disease status field indicates whether the patient was disease-free or had cancer that either recurred or progressed when the studies were undertaken, machine learning approaches will be used to try classifying the set of mutated genes into genes associated with a good prognosis and into genes associated with poor prognosis. Genes that are associated with good prognosis may serve as biomarkers that could be used to predict response to treatment, while drugs that target genes associated with poor prognosis may provide improved outcomes for these patients. 

Finally, I will create a web-based app that will allow researchers to input their gene of interest and display the location of that gene in the Venn diagram above (Plot 1). The region of the Venn diagram in which the gene falls will be color coded to indicate that the gene was found to be mutated in the SCCs enclosing the region and whether it was found to be a predictor of prognosis (one color for good prognosis and another for poor prognosis). For example, FANCD2, is mutated in all of the SCC types except lung SCC, as can be seen at the end of this video: https://www.youtube.com/embed/p8C3DEN2Ah4



### Progress

Thus far, I used R to carry out the analysis. The wrangling and analysis was also performed solely on genomic data (mutated genes and specific mutations), while patient data will need to be added for the supervised machine learning studies. The R code has been deposited in the parent Github repository for this project (ejdna/capstone2/tcga.R). This R script also contains some basic unsupervised machine learning analysis (heirarchical and k-means clustering). This analysis was carried out at the level of the individual studies not at the level of individual patients, the latter being the more appropriate level and the focus of subsequent analysis. The dendrogram from the heirarchical clustering can be found here: ejdna/capstone2/Hclust_scc.jpeg, and pie charts showing the proportion of studies in different clusters from the k-means clustering for different values of k are generated as plots when the script is run (presuming all of the packages are properly updated, which may be a significant problem in itself). For subsequent analysis involving supervised machine learning, I will need to take advantage of the sklearn and TensorFlow packages in Python, which means 1) transferring the cleaned mutation data from R to Pandas and 2) transferring the patient data to Python and preparing it for the machine learning pipeline. 

#### Genomic mutation data

The following R script was used to write the cleaned dataframes from R to csv files: ejdna/capstone2/writeRdataframes2file.R (The <study>_Routput.csf files in this directory are the csv files generated by this script. These dataframes contain only columns that would be of potential interest for the proposed analysis and only rows containing mutations that are predicted to have moderate to high impact on the function of the mutated genes. The notations I used for the studies follows:
     
    Cervical SCC:      cervical  (unpublished TCGA data)
    Esophageal SCC:    esoph   (UCLA study, 2014)
                       esoph2  (ICGC study, 2014)
    Head and Neck SCC: hn   (Broad study, 2011)
                       hn2  (Hopkins study, 2011)
                       hn3  (MDA study, 2013)
                       hn4  (TCGA published data, 2015)
                       hn5  (unpublished TCGA data)
    Lung SCC:          lung  (TCGA published data, 2012)
                       lung2 (unpublished TCGA data)
    Skin SCC:          skin  (Dana Farber study, 2015)

Of these, only cervical, esoph2, hn3, hn4, hn5, lung2 and skin also have patient data files that contained the overall survival information needed for training and testing the supervised machine learning models. Thus, these are the datasets I will focus on going forward.


In [27]:
# load data from csv files into pandas
import pandas as pd
import numpy as np
import re

cervical_genomic_df = pd.read_csv("../cervical_Routput.csv")
cervical_genomic_df.head()

Unnamed: 0.1,Unnamed: 0,Gene,VariantClass,VariantType,SampleBC,MutStat,SeqSource,SampleID,DNAchange,ProteinChange,SIFT,PolyPhen,Impact
0,1,ABCA10,Missense_Mutation,SNP,TCGA-BI-A0VR-01,Somatic,WXS,dc62866c-0dc0-4ef5-bc63-fee1e943a99f,c.2823G>C,p.L941F,tolerated(0.47),benign(0.029),MODERATE
1,2,AIM1,Missense_Mutation,SNP,TCGA-BI-A0VR-01,Somatic,WXS,dc62866c-0dc0-4ef5-bc63-fee1e943a99f,c.1216G>C,p.E406Q,tolerated_low_confidence(0.1),benign(0.048),MODERATE
2,3,AK7,Missense_Mutation,SNP,TCGA-BI-A0VR-01,Somatic,WXS,dc62866c-0dc0-4ef5-bc63-fee1e943a99f,c.76G>T,p.D26Y,deleterious(0),probably_damaging(1),MODERATE
3,4,ANO9,Missense_Mutation,SNP,TCGA-BI-A0VR-01,Somatic,WXS,dc62866c-0dc0-4ef5-bc63-fee1e943a99f,c.292G>A,p.E98K,deleterious(0.02),benign(0.02),MODERATE
4,5,ARID4A,Missense_Mutation,SNP,TCGA-BI-A0VR-01,Somatic,WXS,dc62866c-0dc0-4ef5-bc63-fee1e943a99f,c.1525A>G,p.T509A,tolerated(0.26),benign(0.001),MODERATE


In [28]:
esoph2_genomic_df = pd.read_csv("../esoph2_Routput.csv")
esoph2_genomic_df.head()

Unnamed: 0.1,Unnamed: 0,Gene,VariantClass,VariantType,SampleBC,MutStat,SeqSource,SampleID,DNAchange,ProteinChange,SIFT,PolyPhen,Impact
0,1,CACNA1E,Missense_Mutation,SNP,ESCC-001T,Somatic,,,c.327G>T,p.E109D,deleterious(0.02),benign(0.217),MODERATE
1,2,SP140,Missense_Mutation,SNP,ESCC-001T,Somatic,,,c.934G>C,p.E312Q,deleterious(0.04),probably_damaging(0.974),MODERATE
2,3,PEX5L,Missense_Mutation,SNP,ESCC-001T,Somatic,,,c.64G>A,p.D22N,deleterious_low_confidence(0),benign(0.062),MODERATE
3,4,MATR3,Missense_Mutation,SNP,ESCC-001T,Somatic,,,c.1306G>A,p.E436K,tolerated(0.59),benign(0.013),MODERATE
4,5,ABCA13,Missense_Mutation,SNP,ESCC-001T,Somatic,,,c.847G>C,p.D283H,,probably_damaging(0.917),MODERATE


In [29]:
hn3_genomic_df = pd.read_csv("../hn3_Routput.csv")
hn3_genomic_df.head()

Unnamed: 0.1,Unnamed: 0,Gene,VariantClass,VariantType,SampleBC,MutStat,SeqSource,SampleID,DNAchange,ProteinChange,SIFT,PolyPhen,Impact
0,1,A1BG,Missense_Mutation,SNP,OSCJM-PT01-166-T,Somatic,Capture,,c.155A>G,p.H52R,tolerated(0.91),benign(0),MODERATE
1,2,ABCF3,Missense_Mutation,SNP,OSCJM-PT01-166-T,Somatic,Capture,,c.1123T>C,p.S375P,deleterious(0),possibly_damaging(0.866),MODERATE
2,3,ACAD10,Missense_Mutation,SNP,OSCJM-PT01-166-T,Somatic,Capture,,c.1259G>T,p.R420L,deleterious(0),benign(0.278),MODERATE
3,4,ACSF3,Missense_Mutation,SNP,OSCJM-PT01-166-T,Somatic,Capture,,c.1114G>A,p.V372M,tolerated(0.1),possibly_damaging(0.631),MODERATE
4,5,ADAM21,Missense_Mutation,SNP,OSCJM-PT01-166-T,Somatic,Capture,,c.214C>T,p.H72Y,tolerated(0.06),benign(0.107),MODERATE


In [30]:
hn4_genomic_df = pd.read_csv("../hn4_Routput.csv")
hn4_genomic_df.head()

Unnamed: 0.1,Unnamed: 0,Gene,VariantClass,VariantType,SampleBC,MutStat,SeqSource,SampleID,DNAchange,ProteinChange,SIFT,PolyPhen,Impact
0,1,SLC2A7,Missense_Mutation,SNP,TCGA-BA-4074-01-01-SM-1PNIE,Unknown,Unspecified,,c.17C>A,p.A6E,tolerated(1),benign(0),MODERATE
1,2,VPS13D,Missense_Mutation,SNP,TCGA-BA-4074-01-01-SM-1PNIE,Unknown,Unspecified,,c.10297A>T,p.I3433F,,benign(0.076),MODERATE
2,3,SZT2,Missense_Mutation,SNP,TCGA-BA-4074-01-01-SM-1PNIE,Unknown,Unspecified,,c.7898A>T,p.H2633L,deleterious(0.01),probably_damaging(0.997),MODERATE
3,4,LPHN2,Missense_Mutation,SNP,TCGA-BA-4074-01-01-SM-1PNIE,Unknown,Unspecified,,c.2777C>A,p.A926D,deleterious(0),benign(0.169),MODERATE
4,5,ARHGAP29,Missense_Mutation,SNP,TCGA-BA-4074-01-01-SM-1PNIE,Unknown,Unspecified,,c.2367G>C,p.W789C,tolerated(0.18),benign(0.006),MODERATE


In [31]:
hn5_genomic_df = pd.read_csv("../hn5_Routput.csv")
hn5_genomic_df.head()

Unnamed: 0.1,Unnamed: 0,Gene,VariantClass,VariantType,SampleBC,MutStat,SeqSource,SampleID,DNAchange,ProteinChange,SIFT,PolyPhen,Impact
0,1,ABCA13,Missense_Mutation,SNP,TCGA-4P-AA8J-01,Somatic,WXS,0fe5f8b2-c794-4c93-b2ab-0b544e366f5c,c.6639C>G,p.I2213M,[Not Available],possibly_damaging(0.453),MODERATE
1,2,ZMAT2,Missense_Mutation,SNP,TCGA-4P-AA8J-01,Somatic,WXS,0fe5f8b2-c794-4c93-b2ab-0b544e366f5c,c.363G>C,p.Q121H,deleterious(0),possibly_damaging(0.764),MODERATE
2,3,APPL2,Missense_Mutation,SNP,TCGA-4P-AA8J-01,Somatic,WXS,0fe5f8b2-c794-4c93-b2ab-0b544e366f5c,c.1261G>C,p.E421Q,tolerated(0.42),benign(0.011),MODERATE
3,4,LRSAM1,Missense_Mutation,SNP,TCGA-4P-AA8J-01,Somatic,WXS,0fe5f8b2-c794-4c93-b2ab-0b544e366f5c,c.825A>C,p.E275D,tolerated(0.51),benign(0.002),MODERATE
4,5,ZNF425,Missense_Mutation,SNP,TCGA-4P-AA8J-01,Somatic,WXS,0fe5f8b2-c794-4c93-b2ab-0b544e366f5c,c.276G>A,p.M92I,tolerated(0.07),benign(0),MODERATE


In [32]:
lung2_genomic_df = pd.read_csv("../lung2_Routput.csv")
lung2_genomic_df.head()

Unnamed: 0.1,Unnamed: 0,Gene,VariantClass,VariantType,SampleBC,MutStat,SeqSource,SampleID,DNAchange,ProteinChange,SIFT,PolyPhen,Impact
0,1,ANKFN1,Missense_Mutation,SNP,TCGA-18-3406-01,Somatic,Capture,d3320989-71fd-425b-933e-6e8528a016ed,c.840G>T,p.M280I,tolerated(0.4),benign(0.005),MODERATE
1,2,AGRN,Missense_Mutation,SNP,TCGA-18-3406-01,Somatic,Capture,d3320989-71fd-425b-933e-6e8528a016ed,c.2705C>T,p.A902V,tolerated(0.25),possibly_damaging(0.499),MODERATE
2,3,GLTPD1,Nonstop_Mutation,SNP,TCGA-18-3406-01,Somatic,Capture,d3320989-71fd-425b-933e-6e8528a016ed,c.645G>C,p.*215Yext*44,[Not Available],[Not Available],HIGH
3,4,ACTRT2,Missense_Mutation,SNP,TCGA-18-3406-01,Somatic,Capture,d3320989-71fd-425b-933e-6e8528a016ed,c.1095G>T,p.K365N,tolerated(0.16),benign(0.072),MODERATE
4,5,DCDC2B,Nonsense_Mutation,SNP,TCGA-18-3406-01,Somatic,Capture,d3320989-71fd-425b-933e-6e8528a016ed,c.689C>A,p.S230*,[Not Available],[Not Available],HIGH


In [33]:
skin_genomic_df = pd.read_csv("../skin_Routput.csv")
skin_genomic_df.head()

Unnamed: 0.1,Unnamed: 0,Gene,VariantClass,VariantType,SampleBC,MutStat,SeqSource,SampleID,DNAchange,ProteinChange,SIFT,PolyPhen,Impact
0,1,AFF3,Missense_Mutation,SNP,S00-28455-TP-NT,Somatic,WXS,S00-28455-TP,c.1082C>T,p.S361L,tolerated(0.12),benign(0.251),MODERATE
1,2,AGTR1,Missense_Mutation,SNP,S00-28455-TP-NT,Somatic,WXS,S00-28455-TP,c.377G>T,p.R126L,deleterious(0),probably_damaging(1),MODERATE
2,3,ARID2,Nonsense_Mutation,SNP,S00-28455-TP-NT,Somatic,WXS,S00-28455-TP,c.797G>A,p.W266*,,,HIGH
3,4,AXIN2,Missense_Mutation,SNP,S00-28455-TP-NT,Somatic,WXS,S00-28455-TP,c.290C>T,p.T97I,deleterious(0.01),benign(0.077),MODERATE
4,5,BRCA1,Missense_Mutation,SNP,S00-28455-TP-NT,Somatic,WXS,S00-28455-TP,c.3575C>T,p.P1192L,deleterious(0.02),benign(0.391),MODERATE


#### Patient Data

The patient data was also downloaded from TCGA and added to the notebook. The following code populates Pandas dataframes with the relevant data (chosen columns provide data that is in common to most or all of the datasets).

In [34]:
#Now get the correspending patient data for each dataset
#The patient data files are in the same directory as the R output files
cervical_patient_df = pd.read_table("../cervical_data_bcr_clinical_data_patient.txt", usecols = lambda x: x in [
                                                                                               "PATIENT_ID", 
                                                                                               "GENDER", 
                                                                                               "TOBACCO_SMOKING_HISTORY_INDICATOR", 
                                                                                               "SMOKING_YEAR_STARTED", 
                                                                                               "SMOKING_YEAR_STOPPED",
                                                                                               "SMOKING_PACK_YEARS",
                                                                                               "AGE",
                                                                                               "HISTOLOGICAL_DIAGNOSIS",
                                                                                               "GRADE",
                                                                                               "INITIAL_PATHOLOGIC_DX_YEAR", 
                                                                                               "AJCC_NODES_PATHOLOGIC_PN",
                                                                                               "AJCC_TUMOR_PATHOLOGIC_PT",
                                                                                               "AJCC_METASTASIS_PATHOLOGIC_PM",
                                                                                               "CLINICAL_STAGE", 
                                                                                               "TUMOR_TISSUE_SITE",
                                                                                               "OS_STATUS",
                                                                                               "OS_MONTHS",
                                                                                               "DFS_STATUS",
                                                                                               "DFS_MONTHS"],
                                   skiprows = 4)
cervical_patient_df.head()


Unnamed: 0,PATIENT_ID,GENDER,TOBACCO_SMOKING_HISTORY_INDICATOR,SMOKING_YEAR_STARTED,SMOKING_YEAR_STOPPED,SMOKING_PACK_YEARS,AGE,HISTOLOGICAL_DIAGNOSIS,GRADE,INITIAL_PATHOLOGIC_DX_YEAR,AJCC_NODES_PATHOLOGIC_PN,AJCC_TUMOR_PATHOLOGIC_PT,AJCC_METASTASIS_PATHOLOGIC_PM,CLINICAL_STAGE,TUMOR_TISSUE_SITE,OS_STATUS,OS_MONTHS,DFS_STATUS,DFS_MONTHS
0,TCGA-4J-AA1J,FEMALE,1,[Not Available],[Not Available],[Not Available],31,Cervical Squamous Cell Carcinoma,G3,2013,N0,T1b2,M0,Stage IB2,Cervical,LIVING,17.81,DiseaseFree,17.81
1,TCGA-BI-A0VR,FEMALE,4,22,1995,20,53,Cervical Squamous Cell Carcinoma,G3,2006,N1,T2b,M0,Stage IIIB,Cervical,LIVING,49.44,DiseaseFree,49.44
2,TCGA-BI-A0VS,FEMALE,2,[Not Available],[Not Available],[Not Available],48,Cervical Squamous Cell Carcinoma,G3,2007,N0,T1b1,M0,Stage IB,Cervical,LIVING,57.0,DiseaseFree,57.0
3,TCGA-BI-A20A,FEMALE,[Not Available],[Not Available],[Not Available],[Not Available],49,Cervical Squamous Cell Carcinoma,G3,2010,N0,T1b1,M0,Stage IB1,Cervical,LIVING,23.65,DiseaseFree,23.65
4,TCGA-C5-A0TN,FEMALE,1,[Not Available],[Not Available],[Not Available],21,Cervical Squamous Cell Carcinoma,G3,1997,N1,T1b,MX,Stage IB2,Cervical,DECEASED,11.43,Recurred/Progressed,2.04


In [35]:
esoph2_patient_df = pd.read_table("../esoph2_data_clinical.txt", usecols = lambda x: x in  ["SAMPLE_ID",
                                                                                            "PATIENT_ID", 
                                                                                            "GENDER", 
                                                                                            "AGE",
                                                                                            "OS_STATUS",
                                                                                            "OS_MONTHS",
                                                                                            "TUMOR_STAGE",
                                                                                            "TNM",
                                                                                            "SMOKER",
                                                                                            "CANCER_TYPE_DETAILED"],
                                   skiprows = 5)
esoph2_patient_df.head()

Unnamed: 0,SAMPLE_ID,PATIENT_ID,GENDER,AGE,OS_STATUS,OS_MONTHS,TUMOR_STAGE,TNM,SMOKER,CANCER_TYPE_DETAILED
0,ESCC-001T,ESCC-001T,Male,74,LIVING,24.61,IIIA,T2N2M0,No,Esophageal Squamous Cell Carcinoma
1,ESCC-002T,ESCC-002T,Male,47,DECEASED,7.79,IIB,T2N0M0,Yes,Esophageal Squamous Cell Carcinoma
2,ESCC-003T,ESCC-003T,Female,54,LIVING,45.37,IIB,T2N1M0,No,Esophageal Squamous Cell Carcinoma
3,ESCC-004T,ESCC-004T,Male,60,DECEASED,15.41,IIB,T2N0M0,No,Esophageal Squamous Cell Carcinoma
4,ESCC-005T,ESCC-005T,Male,65,DECEASED,29.37,IIB,T2N0M0,Yes,Esophageal Squamous Cell Carcinoma


In [42]:
hn3_patient_df = pd.read_table("../hn3_data_clinical.txt", usecols = lambda x: x in    ["SAMPLE_ID",
                                                                                        "PATIENT_ID", 
                                                                                        "SEX",
                                                                                        "TUMOR_STAGE",
                                                                                        "AGE",
                                                                                        "NODAL_STAGE",
                                                                                        "NUMBER_OF_LYMPHNODES_POSITIVE",
                                                                                        "SMOKER",
                                                                                        "TOBACCO_SMOKING_HISTORY_INDICATOR",
                                                                                        "OS_STATUS",
                                                                                        "OS_MONTHS", 
                                                                                        "DFS_STATUS",
                                                                                        "DFS_MONTHS",
                                                                                        "CANCER_TYPE_DETAILED"],
                                   skiprows = 5)
hn3_patient_df.head()

Unnamed: 0,SAMPLE_ID,PATIENT_ID,SEX,TUMOR_STAGE,AGE,NODAL_STAGE,NUMBER_OF_LYMPHNODES_POSITIVE,SMOKER,TOBACCO_SMOKING_HISTORY_INDICATOR,OS_STATUS,OS_MONTHS,DFS_STATUS,DFS_MONTHS,CANCER_TYPE_DETAILED
0,OSCJM-PT35-37-T,OSCJM-PT35-37,Male,T4,70,N2b,7,Yes,Yes,DECEASED,6.9,Recurred/Progressed,6.3,Head and Neck Squamous Cell Carcinoma
1,OSCJM-PT07-43-T,OSCJM-PT07-43,Male,T4,50,N2b,2,Yes,Yes,LIVING,51.1,Diseasefree,48.6,Head and Neck Squamous Cell Carcinoma
2,OSCJM-PT05-91-T,OSCJM-PT05-91,Male,T3,41,N1,1,Yes,Yes,LIVING,110.8,Diseasefree,1.4,Head and Neck Squamous Cell Carcinoma
3,OSCJM-PTC-139-T,OSCJM-PTC-139,Male,T3,78,N1,1,No,No,DECEASED,10.8,Recurred/Progressed,9.7,Head and Neck Squamous Cell Carcinoma
4,OSCJM-PT01-166-T,OSCJM-PT01-166,Male,T3,63,N0,0,Yes,Yes,LIVING,40.9,Diseasefree,39.0,Head and Neck Squamous Cell Carcinoma


In [43]:
hn4_patient_df = pd.read_table("../hn4_data_clinical.txt", usecols = lambda x: x in    ["SAMPLE_ID",
                                                                                        "PATIENT_ID",
                                                                                        "OS_STATUS",
                                                                                        "OS_MONTHS", 
                                                                                        "CANCER_TYPE_DETAILED"],
                                   skiprows = 5)
hn4_patient_df.head()

Unnamed: 0,SAMPLE_ID,PATIENT_ID,OS_STATUS,OS_MONTHS,CANCER_TYPE_DETAILED
0,TCGA-BA-4074-01,TCGA-BA-4074-01,DECEASED,15.15,Head and Neck Squamous Cell Carcinoma
1,TCGA-BA-4076-01,TCGA-BA-4076-01,DECEASED,13.63,Head and Neck Squamous Cell Carcinoma
2,TCGA-BA-4078-01,TCGA-BA-4078-01,DECEASED,9.07,Head and Neck Squamous Cell Carcinoma
3,TCGA-BA-5149-01,TCGA-BA-5149-01,LIVING,8.15,Head and Neck Squamous Cell Carcinoma
4,TCGA-BA-5151-01,TCGA-BA-5151-01,LIVING,6.24,Head and Neck Squamous Cell Carcinoma


In [48]:
hn5_patient_df = pd.read_table("../hn5_data_bcr_clinical_data_patient.txt", usecols = lambda x: x in ["PATIENT_ID",  
                                                                                                      "HISTOLOGICAL_DIAGNOSIS",
                                                                                                      "GENDER",
                                                                                                      "LYMPH_NODES_EXAMINED_HE_COUNT",
                                                                                                      "AJCC_TUMOR_PATHOLOGIC_PT",
                                                                                                      "AJCC_NODES_PATHOLOGIC_PN",
                                                                                                      "AJCC_METASTASIS_PATHOLOGIC_PM",
                                                                                                      "AJCC_PATHOLOGIC_TUMOR_STAGE",
                                                                                                      "GRADE",
                                                                                                      "TOBACCO_SMOKING_HISTORY_INDICATOR", 
                                                                                                      "SMOKING_YEAR_STARTED", 
                                                                                                      "SMOKING_YEAR_STOPPED",
                                                                                                      "SMOKING_PACK_YEARS",
                                                                                                      "AGE",
                                                                                                      "CLIN_M_STAGE",
                                                                                                      "CLIN_N_STAGE",
                                                                                                      "CLIN_T_STAGE",
                                                                                                      "CLINICAL_STAGE", 
                                                                                                      "TUMOR_TISSUE_SITE",
                                                                                                      "OS_STATUS",
                                                                                                      "OS_MONTHS",
                                                                                                      "DFS_STATUS",
                                                                                                      "DFS_MONTHS"],
                            skiprows = 4)
hn5_patient_df.head()

Unnamed: 0,PATIENT_ID,HISTOLOGICAL_DIAGNOSIS,GENDER,LYMPH_NODES_EXAMINED_HE_COUNT,AJCC_TUMOR_PATHOLOGIC_PT,AJCC_NODES_PATHOLOGIC_PN,AJCC_METASTASIS_PATHOLOGIC_PM,AJCC_PATHOLOGIC_TUMOR_STAGE,GRADE,TOBACCO_SMOKING_HISTORY_INDICATOR,...,AGE,CLIN_M_STAGE,CLIN_N_STAGE,CLIN_T_STAGE,CLINICAL_STAGE,TUMOR_TISSUE_SITE,OS_STATUS,OS_MONTHS,DFS_STATUS,DFS_MONTHS
0,TCGA-BA-4074,Head & Neck Squamous Cell Carcinoma,MALE,5,T2,N2c,M0,Stage IVA,G3,2,...,69,M0,N2c,T3,Stage IVA,Head and Neck,DECEASED,15.18,Recurred/Progressed,13.01
1,TCGA-BA-4075,Head & Neck Squamous Cell Carcinoma,MALE,0,T3,N0,M0,Stage III,G2,2,...,49,M0,N1,T4a,Stage IVA,Head and Neck,DECEASED,9.3,Recurred/Progressed,7.75
2,TCGA-BA-4076,Head & Neck Squamous Cell Carcinoma,MALE,[Not Available],TX,NX,[Not Available],[Not Available],G2,2,...,39,M0,N2c,T3,Stage IVA,Head and Neck,DECEASED,13.63,Recurred/Progressed,9.40
3,TCGA-BA-4077,Head & Neck Squamous Cell Carcinoma,FEMALE,[Not Available],T4a,N0,M0,Stage IVA,G2,4,...,45,M0,N3,T4b,Stage IVB,Head and Neck,DECEASED,37.25,[Not Available],[Not Available]
4,TCGA-BA-4078,Head & Neck Squamous Cell Carcinoma,MALE,4,[Not Available],[Not Available],[Not Available],[Not Available],G2,4,...,83,M0,N2a,T2,Stage IVA,Head and Neck,DECEASED,9.07,[Not Available],[Not Available]


In [49]:
lung2_patient_df = pd.read_table("../lung2_data_bcr_clinical_data_patient.txt", usecols = lambda x: x in ["PATIENT_ID",  
                                                                                                          "HISTOLOGICAL_DIAGNOSIS",
                                                                                                          "GENDER",
                                                                                                          "INITIAL_PATHOLOGIC_DX_YEAR",
                                                                                                          "AJCC_TUMOR_PATHOLOGIC_PT",
                                                                                                          "AJCC_NODES_PATHOLOGIC_PN",
                                                                                                          "AJCC_METASTASIS_PATHOLOGIC_PM",
                                                                                                          "AJCC_PATHOLOGIC_TUMOR_STAGE",
                                                                                                          "TOBACCO_SMOKING_HISTORY_INDICATOR", 
                                                                                                          "SMOKING_YEAR_STARTED", 
                                                                                                          "SMOKING_YEAR_STOPPED",
                                                                                                          "SMOKING_PACK_YEARS",
                                                                                                          "GRADE",
                                                                                                          "AGE",
                                                                                                          "TUMOR_TISSUE_SITE",
                                                                                                          "OS_STATUS",
                                                                                                          "OS_MONTHS",
                                                                                                          "DFS_STATUS",
                                                                                                          "DFS_MONTHS"],
                            skiprows = 4)
lung2_patient_df.head()

Unnamed: 0,PATIENT_ID,HISTOLOGICAL_DIAGNOSIS,GENDER,INITIAL_PATHOLOGIC_DX_YEAR,AJCC_TUMOR_PATHOLOGIC_PT,AJCC_NODES_PATHOLOGIC_PN,AJCC_METASTASIS_PATHOLOGIC_PM,AJCC_PATHOLOGIC_TUMOR_STAGE,TOBACCO_SMOKING_HISTORY_INDICATOR,SMOKING_YEAR_STARTED,SMOKING_YEAR_STOPPED,SMOKING_PACK_YEARS,AGE,TUMOR_TISSUE_SITE,OS_STATUS,OS_MONTHS,DFS_STATUS,DFS_MONTHS
0,TCGA-18-3407,Lung Squamous Cell Carcinoma,MALE,2003,T2,N0,M0,Stage IB,3,[Not Available],1988,40,72,Lung,DECEASED,4.47,[Not Available],[Not Available]
1,TCGA-18-3408,Lung Squamous Cell Carcinoma,FEMALE,2004,T2,N0,M0,Stage IB,4,[Not Available],2004,30,77,Lung,DECEASED,75.69,Recurred/Progressed,58.90
2,TCGA-18-3409,Lung Squamous Cell Carcinoma,MALE,2004,T1,N0,M0,Stage IA,3,[Not Available],1974,20,74,Lung,LIVING,123.09,Recurred/Progressed,75.26
3,TCGA-18-3410,Lung Squamous Cell Carcinoma,MALE,2004,T3,N0,M0,Stage IIB,3,[Not Available],[Not Available],[Not Available],81,Lung,DECEASED,4.8,[Not Available],[Not Available]
4,TCGA-18-3411,Lung Squamous Cell Carcinoma,FEMALE,2005,T2,N2,M0,Stage IIIA,2,[Not Available],[Not Available],50,63,Lung,LIVING,117.48,DiseaseFree,117.48


In [52]:
skin_patient_df = pd.read_table("../skin_patient_data_clinical.txt", usecols = lambda x: x in  ["SAMPLE_ID",
                                                                                        "PATIENT_ID",
                                                                                        "AGE",
                                                                                        "GENDER", 
                                                                                        "SMOKER",
                                                                                        "DFS_STATUS",
                                                                                        "OS_STATUS",
                                                                                        "PFS_MONTHS"
                                                                                        "OS_MONTHS",
                                                                                        "CANCER_TYPE_DETAILED"],
                                   skiprows = 5)
skin_patient_df.head()

Unnamed: 0,SAMPLE_ID,PATIENT_ID,AGE,GENDER,SMOKER,DFS_STATUS,OS_STATUS,CANCER_TYPE_DETAILED
0,S00-28455-TP-NT,S00-28455,73,Female,No,Recurred/Progressed,LIVING,Cutaneous Squamous Cell Carcinoma
1,S00-35182-TP-NT,S00-35182,84,Male,Yes,DiseaseFree,DECEASED,Cutaneous Squamous Cell Carcinoma
2,S01-27743-TP-NT,S01-27743,72,Female,No,Recurred/Progressed,LIVING,Cutaneous Squamous Cell Carcinoma
3,S02-14875,S02-14785,48,Male,No,DiseaseFree,LIVING,Cutaneous Squamous Cell Carcinoma
4,S02-15015-TP-S00-35182-NT,S02-15015,75,Male,Yes,Recurred/Progressed,LIVING,Cutaneous Squamous Cell Carcinoma


#### Prepping the data for the machine learning pipeline

The previous data analysis done in R was done using mutation data that was flattened across the study. That is, for each study, a column was provided in the dataframe for each of the genes mutated in all of the datasets and a value of 1, if the mutation was identified in the study, or 0, if not, was assigned to that gene for that study. However, to employ machine learning, I need to flatten over each patient in each study since each patient has a different clinical outcome, which is what I am trying to predict. 

The following code flattens each genomic dataframe as follows:

patient_ID (this needs to be extracted from barcode ID) | gene1_impact(0 for not mutated, 1 for moderate, and 2 for high) | gene2_impact | ...

Thus, the mutation status of each gene is a feature in the matrix used for machine learning.

In [None]:
#first extract patient ID from barcode id 

