## Module_3: *Cancer*

## Team Members:
Gabby Holohan & Meredith Lineweaver

## Project Title:
Impact of E-Cadherin gene on Breast Cancer Prognosis



## Project Goal:
This project seeks to... *(what is the purpose of your project -- i.e., describe the question that you seek to answer by analyzing data.)*

## Disease Background:
*Pick a hallmark to focus on, and figure out what genes you are interested in researching based on that decision. Then fill out the information below.*

* Cancer hallmark focus: Activating Invasion and Metastasis
* Overview of hallmark:
Invasion and metastasis are categorized in a series of steps beginning with local invasion, followed by expansion to blood and lymphatic vessels, escape of cancer cells into tissues, and formation of tumors. In carcinoma cells, it is often found that a molecule responsible for shaping endothelial sheets, E-cadherin, is decreased, allowing cancer cells to migrate and the cancer to metastasize. Some transcription factors such as Snail, Slug, Twist, and Zeb1/2 help regulate the migratory process of cells, and some are even involved in the E-cadherin gene expression. The exact interactions and roles of these transcription factors on E-cadherin expression remain largely unknown. The general consensus, however, is clear that increasing expression of molecules aiding the assembly of endothelial sheets or structures like E-cadherin decreases the ability of cancer to metastasize while decreasing expression of molecules like E-cadherin enhances the cancer’s ability to spread.
* Genes associated with hallmark to be studied (describe the role of each gene, signaling pathway, or gene set you are going to investigate): 
The EDH1 gene transcribes E-cadherin, a protein that allows endothelial cells to adhere to one another. Deficiency in E-cadherin results in Epithelial-mesenchymal transition (EMT): a condition which causes cells to become depolarized and unable to adhere to one another.
https://www.sciencedirect.com/science/article/abs/pii/S0924224424000748?via%3Dihub
Mutations, proteolytic cleavage, chromosomal deletions, epigenetic regulation and transcriptional silencing of CDH1 promoter may limit the functionality of E-cadherin, especially in gastric, breast, liver, pancreatic, and skin cancer
Loss of E-cadherin activates EMT transcription factors --> metastasis/invasion
E-cadherin regulates receptor tyrosine kinase (RTK) and tyrosine kinase Src
E-cadherin-expressing cell line w/ increased nuclear factor kappa-light-chain-enhancer of activated B cells (NF-κB) activity and increased c-Myc expression promote cell proliferation when adenosine triphosphate (ATP) production increases, increasing glycolysis and oxidative phosphorylation rates
N-cadherin is another protein in the cadherin family which contributes to EMT at increased levels. EMT is found to inhibit apoptosis via the death receptor 4 TNF-related apoptosis-inducing ligand-receptor 1 (TRAIL-R1) and/or death receptor 5 TRAIL-R2
https://pmc.ncbi.nlm.nih.gov/articles/PMC6830116/
CDH2 is the gene related to N-cadherin expression, often involved in neurocognitive disease.
Elevated expression of CDH2 is tied to decreased expression in EDH1 and the development of EMT
https://www.frontiersin.org/journals/neuroscience/articles/10.3389/fnins.2022.972059/full

*Will you be focusing on a single cancer type or looking across cancer types? Depending on your decision, update this section to include relevant information about the disease at the appropriate level of detail. Regardless, each bullet point should be filled in. If you are looking at multiple cancer types, you should investigate differences between the types (e.g. what is the most prevalent cancer type? What type has the highest mortality rate?) and similarities (e.g. what sorts of treatments exist across the board for cancer patients? what is common to all cancers in terms of biological mechanisms?). Note that this is a smaller list than the initial 11 in Module 1.*

* Prevalence & incidence
    - Breast cancer is second most common cancer for women in US, 2nd leading cause of cancer death, and leading cause of cancer death for Hispanic and Black women
    https://www.cdc.gov/breast-cancer/statistics/index.html
    - 279,731 new breast cancers in women in the US were reported in 2022, and 42,213 women in the US died from breast cancer in 2023
    - Incidence is 132.9 per 100,000 in US annually
    https://www.cdc.gov/united-states-cancer-statistics/publications/breast-cancer-stat-bite.html
* Risk factors (genetic, lifestyle) & Societal determinants
    - being female
    - older age
    - family history
    - personal history of cancer
    - dense breasts
    - physical inactivity
    - being overweight or obese
    - alcohol consumption
    - hormone use/oral contraceptive use
    - societal determinants: poverty, lack of education, unemployment, lower health literacy, lack of health insurance, living in disadvantaged neighborhoods, housing and food insecurity, delayed childbearing
    - genetic mutations in BRCA1, BRCA2, ATM, BARD1, BRIP1, CHEK2, CDH1
    https://www.cdc.gov/breast-cancer/risk-factors/index.html
* Standard of care treatments (& reimbursement)
    - surgery: mastectomies and lumpectomies
    - radiation therapy
    - chemotherapy
    - targeted therapy drugs
    - hormone therapy
    - reimbursement depends on insurance type (ex. private, Medicare, or Medicaid)
    https://www.breastcancer.org/managing-life/covering-cost-of-care/cost-of-care-report
    https://www.cancer.gov/types/breast/hp/breast-treatment-pdq
* Biological mechanisms (anatomy, organ physiology, cell & molecular physiology)
    - most breast cancers originate in the epithelial cells lining the ducts or the lobules
    - genetic mutations accumulate in the breast epithelial cells: spontaneous or inherited mutations
    - chromosomal rearrangements can activate oncogenes/silence tumor supressor genes - these rearrangements can be triggered by estrogen
    - if a breast cancer is hormone receptor positive that means they grow in response to estrogen or progesterone, they can also overexpress HER2
    - additional mutations can lead to invasion, angiogenesis, and metastasis
    https://link.springer.com/chapter/10.1007/978-3-319-21683-6_9


## Data-Set: 

*Once you decide on the subset of data you want to use (i.e. only 1 cancer type or many; any clinical features needed?; which genes will you look at?) describe the dataset. There are a ton of clinical features, so you don't need to describe them all, only the ones pertinent to your question.*



The data analyzed for this project comes from a paper published in Bioinformatics titled "Alternative preprocessing of RNA-Sequencing data in The Cancer Genome Atlas leads to improved analysis results". These scientists collected data for the 9265 tumor and 741 normal samples across 24 cancer types using the Rsubread package. To obtain the RNA-Sequencing and clinical data, the scientists compared TCGA samples that were processed using pipelines and determined the Rsubread pipeline produced fewer errors and more consistent expression levels. For this project we only used data that contained information about the CDH1 gene. 


## Data Analyis: 

### Methods
The machine learning technique I am using is: *fill in and describe*

*What is this method optimizing? How does the model decide it is "good enough"?*

**

### Analysis
*(Describe how you analyzed the data. This is where you should intersperse your Python code so that anyone reading this can run your code to perform the analysis that you did, generate your figures, etc.)*

In [None]:

# %%
import pandas as pd

clin_df = pd.read_csv(
    '../Data/raw/GSE62944_06_01_15_TCGA_24_548_Clinical_Variables_9264_Samples.txt', index_col=0, header=0, sep='\t')

clin_df = clin_df.transpose()
print(clin_df.head())
# %%
survival_data = pd.read_excel(
    "../data/Metadata_with_survival.xlsx", index_col=1, header=0, sheet_name='TCGA-CDR')
print(survival_data.head())

# %%
data = pd.read_csv('../Data/GSE62944_metadata.csv',
                   index_col=0, header=0)
print(data.head())
# %%
subset_survival_data = pd.DataFrame(columns=survival_data.columns)
for i, r in data.iterrows():
    barcode = r['bcr_patient_barcode']
    if barcode in survival_data.index:

        new_row_df = pd.DataFrame(
            [survival_data.loc[barcode]])  # wrap dict in a list
        new_row_df.index = [i]
        subset_survival_data = pd.concat(
            [subset_survival_data, new_row_df], ignore_index=False)
    else:
        print(f"Barcode {i} not found in survival data.")
        # add a row of NaNs
        new_row_df = pd.DataFrame(
            [pd.Series({col: pd.NA for col in survival_data.columns}, name=i)])
        subset_survival_data = pd.concat(
            [subset_survival_data, new_row_df], ignore_index=False)

subset_survival_data = subset_survival_data.drop("Unnamed: 0", axis=1)

subset_survival_data.to_csv('../Data/subsampled_TCGA_CDR_metadata.csv')

## Verify and validate your analysis: 
*Pick a SPECIFIC method to determine how well your model is performing and describe how it works here.*

*(Describe how you checked to see that your analysis gave you an answer that you believe (verify). Describe how your determined if your analysis gave you an answer that is supported by other evidence (e.g., a published paper).*

## Conclusions and Ethical Implications: 
*(Think about the answer your analysis generated, draw conclusions related to your overarching question, and discuss the ethical implications of your conclusions.*

## Limitations and Future Work: 
*(Think about the answer your analysis generated, draw conclusions related to your overarching question, and discuss the ethical implications of your conclusions.*

## NOTES FROM YOUR TEAM: 
*This is where our team is taking notes and recording activity.*

## QUESTIONS FOR YOUR TA: 
*These are questions we have for our TA.*