## Module_3: *(Template)*

## Team Members:
- Hudson King and Angelina Leung

## Project Title:
VEFG Trends Across Various Cancers



## Project Goal:
- How the expressiong of VEGF Ligands vary across different cancers.
- Is there a cancer that requires greater angiogenesis to sustain life?

## Disease Background:
*Pick a hallmark to focus on, and figure out what genes you are interested in researching based on that decision. Then fill out the information below.*

* Cancer hallmark focus: 
    - Sustained Angiogenesis
* Overview of hallmark: 
    - Our hallmark of interest is sustained angiogenesis, which is the process of growing new blood cells. In a healthy case, angiogenesis may occur to heal damaged tissue by supplying it with oxygen and nutrients. Similarly, it can nourish cancer cells and promote their growth and spread. Often, angiogenesis is seen in the early stages of many cancers. There is usually a balance between angiogenesis promoters, such as vascular endothelial growth factor (VEGF) and fibroblast growth factor (FGF1/2), and inhibitors, such as thrombospondin-1; however, this balance is disrupted with cancer. VEGF and FGFs have greater expression while thrombospondin-1 has decreased expression. We will focus on VEGF. These growth factors bind to transmembrane tyrosine kinase receptors on endothelial cells, which are involved in cell proliferation. The binding of VEGF on tyrosine kinase receptors triggers a signal-transduction cascade that activates the ras oncogene and the downstream effectors that further promote more growth factors.
* Genes associated with hallmark to be studied (describe the role of each gene, signaling pathway, or gene set you are going to investigate):
    - VEGF drives new blood vessel growth by activating its receptors on endothelial cells, boosting their proliferation, survival, and movement. It rises with low oxygen and tumor signals, and in our data can be tracked via VEGFA and receptor expression plus an angiogenesis gene-set score
        - https://doi.org/10.1016/S0092-8674(00)81683-9

*Will you be focusing on a single cancer type or looking across cancer types? Depending on your decision, update this section to include relevant information about the disease at the appropriate level of detail. Regardless, each bullet point should be filled in. If you are looking at multiple cancer types, you shoul.d investigate differences between the types (e.g. what is the most prevalent cancer type? What type has the highest mortality rate?) and similarities (e.g. what sorts of treatments exist across the board for cancer patients? what is common to all cancers in terms of biological mechanisms?). Note that this is a smaller list than the initial 11 in Module 1.*

* Prevalence & incidence
    - All Cancer U.S. death rates is approximately 145.4 per 100k (2019-2023)
    - Lung cancer contributes the most deaths worldwide
        - https://acsjournals.onlinelibrary.wiley.com/doi/full/10.3322/caac.21834
        - https://www.cancer.org/content/dam/cancer-org/research/cancer-facts-and-statistics/annual-cancer-facts-and-figures/2025/2025-cancer-facts-and-figures-acs.pdf
* Risk factors (genetic, lifestyle) & Societal determinants
    - Tobacco and alcohol usage, obesity and physical inactivity, oncogenic infectins, environmental exposures, age, and family history. 
    - Incidence and stage at diagnosis varies across ethinicitues, incomes, geopgraphy, sexual orientation, and accessability to medical treatment. 
        - https://www.cdc.gov/cancer/risk-factors/index.html
        - https://www.cancer.gov/about-cancer/understanding/disparities
* Standard of care treatments (& reimbursement)
    - Across cancers, standard of care includes surgery, radiation therapy, chemotherapy, immunotherapy, and endocrine therapy. 
    - Medicare covers various cancer services including inpatient and outpatient radiation services and numerous chemo/targeted delivery drugs.
        - https://www.medicare.gov/coverage/radiation-therapy
* Biological mechanisms (anatomy, organ physiology, cell & molecular physiology)
    - Across the various cancers there are certain distinguished hallmarks that define the biological mechanisms of cancer: Evading apoptosis, self-sufficient growth signals, sustained angiogenesis, limitless replicative potential, tissue invasion and metastasis, and insensitivity to anti-growth signals. 


## Data-Set: 

*Once you decide on the subset of data you want to use (i.e. only 1 cancer type or many; any clinical features needed?; which genes will you look at?) describe the dataset. There are a ton of clinical features, so you don't need to describe them all, only the ones pertinent to your question.*

*(Describe the data set(s) you will analyze. Cite the source(s) of the data. Describe how the data was collected -- What techniques were used? What units are the data measured in? Etc.)*

* Data was collected via specimen and sequencing, read alignment, and normalization techniques from a pan-cancer compendium of TCGA RNA-seq profiles processed by Rahman & Piccolo's group. 

* Origin: TCGA tumor and matched normal tissues for 24 different types of cancer. 
    - ACC: Adrenocortical carcinoma
    - BLCA: Bladder urothelial carcinoma
    - BRCA: Breast invasive carcinoma
    - CESC: Cervical squamos cell carconoma and Endocervical adenocarcinoma
    - COAD: Colon adenocarcinoma
    - DLBC – Lymphoid neoplasm diffuse large B-cell lymphoma
    - GBM: Glioblastoma multiforme
    - HNSC: Head and neck squamous cell carcinoma
    - KICH: Kidney chromophobe
    - KIRC: Kidney renal clear cell carcinoma
    - KIRP: Kidney renal papillary cell carcinoma
    - LAML: Acute myeloid leukemia
    - LGG: Brain lower grade glioma
    - LIHC: Liver hepatocellular carcinoma
    - LUAD: Lung adenocarcinoma
    - LUSC: Lung squamous cell carcinoma
    - OV: Ovarian serous cystadenocarcinoma
    - PRAD: Prostate adenocarcinoma
    - READ: Rectum adenocarcinoma
    - SKCM: Skin cutaneous melanoma
    - STAD: Stomach adenocarcinoma
    - THCA: Thyroid carcinoma
    - UCEC: Uterine corpus endometrial carcinoma
    - UCS: Uterine carcinosarcoma
* Rows: Protein coding genes
* Columns: TCGA sample barcodes
* Values: log2(TPM+1) expression values
- https://www.ncbi.nlm.nih.gov/geo/query/acc.cgi?acc=GSE62944


## Data Analyis: 

### Methods
The machine learning technique I am using is: *fill in and describe*

*What is this method optimizing? How does the model decide it is "good enough"?*

**

### Analysis
*(Describe how you analyzed the data. This is where you should intersperse your Python code so that anyone reading this can run your code to perform the analysis that you did, generate your figures, etc.)*

In [None]:
import csv

class Gene:
    all_genes = []
    sample_ids = []

    def __init__(self, symbol, expression_dict):
        self.symbol = symbol                        
        self.expression = expression_dict             
        Gene.all_genes.append(self)

    def __repr__(self):
        return f"{self.symbol} | Samples: {len(self.expression)}"

    def get_symbol(self):
        return self.symbol
    
    def get_expression(self):
        return self.expression

    def get_value(self, sample_id: str):
        return self.expression.get(sample_id, None)

    @classmethod
    def instantiate_from_csv(cls, filename: str):
        """
        Creates one Gene object per row.
        Expects the first column to be gene symbols and the remaining columns to be TCGA samples.
        """
        with open(filename, encoding="utf8", newline="") as f:
            reader = csv.DictReader(f)
            fieldnames = reader.fieldnames
            if not fieldnames:
                raise ValueError("CSV has no header.")

            gene_col = fieldnames[0]
            cls.sample_ids = [h for h in fieldnames if h != gene_col]

            rows = list(reader)
            for row in rows:
                symbol = row[gene_col]

                # Build {sample_id: float_value} dict for this gene
                expr = {}
                for sid in cls.sample_ids:
                    val = row.get(sid, "")
                    expr[sid] = float(val) if val != "" else None

                Gene(symbol=symbol, expression_dict=expr)

            # sort by symbol, like you did with depth
            Gene.all_genes.sort(key=Gene.get_symbol)

Gene.instantiate_from_csv("GSE62944_subsample_log2TPM.csv")

In [7]:
print(len(Gene.all_genes), "Genes")
print("First 3 samples:", Gene.sample_ids[:3])
print(Gene.all_genes[0])                  
s0 = Gene.sample_ids[0]
print(Gene.all_genes[0].get_value(s0))   # log2TPM for first gene in first sample

15716 Genes
First 3 samples: ['TCGA-E9-A1NI-01A-11R-A14D-07', 'TCGA-E2-A1LK-01A-21R-A14D-07', 'TCGA-BH-A0B2-01A-11R-A10J-07']
A1BG | Samples: 1802
3.397369106338275


## Verify and validate your analysis: 
*Pick a SPECIFIC method to determine how well your model is performing and describe how it works here.*

*(Describe how you checked to see that your analysis gave you an answer that you believe (verify). Describe how your determined if your analysis gave you an answer that is supported by other evidence (e.g., a published paper).*

## Conclusions and Ethical Implications: 
*(Think about the answer your analysis generated, draw conclusions related to your overarching question, and discuss the ethical implications of your conclusions.*

## Limitations and Future Work: 
*(Think about the answer your analysis generated, draw conclusions related to your overarching question, and discuss the ethical implications of your conclusions.*

## NOTES FROM YOUR TEAM: 
- Working on fine-tuning question and exploring different data sets. 

## QUESTIONS FOR YOUR TA: 
*These are questions we have for our TA.*

In [None]:
N/A