# Identifying Driver Mutations that Predict Prognosis in Squamous Cell Carcinomas Using TCGA Data

### Data Incubator Capstone, Summer 2017, Eric Jaehnig

The Cancer Genome Atlas (TCGA) contains data from genomic (i.e., mutations, copy number aberrations), transcriptomic (gene expression), and proteomic (protein levels) analysis of human cancer samples from 150 studies (see http://www.cbioportal.org/faq.jsp). Associated with these large biological datasets are various types of clinical data describing characteristics of the patients from which the samples were obtained. For this project, I propose to use mutation data, the most abundant data type in TCGA, to identify potential cancer driver mutations for squamous cell carcinomas. Of the 150 datasets deposited in TCGA, 11 include mutations identified from whole genome or whole exome sequencing of 5 different types of squamous cell cell carcinomas from 2200 cancer patients. My goal is to determine if there is a set of common genes that is mutated in squamous cell cancers and to identify the cellular processes and pathways that these genes regulate. Since these genes likely drive carcinogenesis in squamous epithelial cells, drugs that target the pathways they are involved in could potentially be used to treat squamous cell cancers in the future. While a quick Pubmed search reveals that specific squamous cell carcinoma or broad pan-cancer analysis of TCGA data is common, I only found one study focused on one gene that involved pan-cancer analysis specifically for squamous cell cancers, lending credence to the novelty of the proposed project.

I downloaded the datasets for each of the 5 squamous cell carcinomas: cervical, esophageal, head and neck, lung, and skin. I loaded the “data_mutations_extended.txt” files for each of the 11 datasets into R dataframes. Each row in these files contains a uniquely identified mutation (multiple mutations per patient, each of which occupies a separate row). I selected only columns that may be of potential use when populating the dataframes, including gene names, the specific nucleotide changes, the corresponding amino acid changes, and the impact that these changes are predicted to have on the gene’s function. Since some mutations are predicted to have little or no effect on function, I focused on genes for which the predicted impact is “medium” or “high”. I then determined the set of genes that were mutated for each of the 5 squamous cell cancers. 

Since cancer is marked by genomic instability, which can lead to passenger mutations in several genes that don’t contribute carcinogenesis, I decided to focus on genes that were more likely to be driver mutations that promote cancer progression. To do this, I generated the Venn diagram shown in Plot 1. This diagram shows how the 5 sets of genes overlap. My rationale for focusing on the set of genes that was mutated in all of the squamous cell carcinomas has two components. The first is that passenger mutations are more likely to be random, while driver mutations are more likely to occur more consistently; thus, focusing on genes mutated in all squamous cell carcinomas is a strategy for weeding out passenger mutations. The second is that my goal is to identify the mutated genes that drive squamous cell carcinogenesis in general, not specific types of squamous cell carcinomas. I found that there were approximately 14,000 unique genes mutated in lung and head and neck cancer, ~12,000 in cervical cancer, ~4500 in esophageal cancer, and ~500 mutated in skin cancer. Of these genes, only 193 were mutated in all 5 cancers (center of Plot 1, download pdf here: https://agile-ocean-41073.herokuapp.com/).

Next, I evaluated this set of genes for cellular pathway (KEGG) enrichment because mutation of multiple genes in the same pathway suggests that the pathway plays an important role in carcinogenesis. 89 KEGG pathways showed statistically significant enrichment for genes mutated in all of the squamous cell cancers. I chose to analyze one of these, the second most significant pathway, more closely by plotting the mutated genes to pathway in Plot 2(https://limitless-stream-36820.herokuapp.com/). The plot shows all of the genes in the pathway, EGFR tyrosine kinase inhibitor resistance, with the genes mutated in squamous cell cancers colored red. While most of these genes already have established roles in cancer, the fact that squamous cell cancers show such strong enrichment for mutations in genes involved in resistance to EGFR inhibitors suggests that these drugs may have limited efficacy in treating squamous cell carcinomas in the long-term. 

More informative analysis of these genes is possible when considering the characteristics of the patient population, which is what I propose to do for the fellowship. Each dataset also contains a “data_clinical.txt” file that includes each patient’s age, sex, and disease status. Since the disease status field indicates whether the patient was disease-free or had cancer that either recurred or progressed when the studies were undertaken, machine learning approaches will be used to try classifying the set of mutated genes into genes associated with a good prognosis and into genes associated with poor prognosis. Genes that are associated with good prognosis may serve as biomarkers that could be used to predict response to treatment, while drugs that target genes associated with poor prognosis may provide improved outcomes for these patients. 

Finally, I will create a web-based app that will allow researchers to input their gene of interest and display the location of that gene in the Venn diagram above (Plot 1). The region of the Venn diagram in which the gene falls will be color coded to indicate that the gene was found to be mutated in the SCCs enclosing the region and whether it was found to be a predictor of prognosis (one color for good prognosis and another for poor prognosis). For example, FANCD2, is mutated in all of the SCC types except lung SCC, as can be seen at the end of this video: https://www.youtube.com/embed/p8C3DEN2Ah4

