## ðŸ“˜ Notebook 00: Biomedical background

---
### **Summary**

This notebook provides a comprehensive introduction to breast cancer classification, with a particular focus on the molecular subtyping of tumors based on gene expression data. 

Molecular subtyping has become an essential tool in precision oncology, enabling more personalized treatment strategies and improving patient outcomes. While traditional methods such as histopathological evaluation and immunohistochemistry (IHC) remain standard in clinical practice, gene expression profiling allows for a more detailed understanding of tumor biology and heterogeneity.

This project aims to build an explainable machine learning model that classifies breast cancer into molecular subtypes using RNA-seq data. By leveraging publicly available datasets (e.g., TCGA-BRCA), the model will not only generate accurate predictions but also provide interpretable insights into the biological basis of classification decisions. 

This notebook lays the necessary biomedical and methodological groundwork for understanding the medical relevance, limitations, and scientific motivation behind the project.

---

### 1. **Introduction**

Breast cancer is the most commonly diagnosed malignancy among women worldwide, with approximately 2.3 million new cases and 670,000 deaths reported in 2022 alone[[1,2,3]](#references). Despite advances in detection and treatment, several challenges persist, including its high incidence, substantial mortality rate, and the difficulty of achieving accurate diagnoses and therefore selecting the most effective therapy.

Importantly, *breast cancer is not a single disease*. Instead, it encompasses a biologically and clinically heterogeneous group of tumors. To guide appropriate treatment strategies, classification into specific subtypes is essential.

There are four main categories commonly used to classify breast cancer:

- **Histopathological subtypes:**  
  Determined by examining tumor tissue under a microscope. This approach distinguishes between different morphological patterns such as IDC and ILC. It remains the most widely used and clinically established classification and is routinely performed as part of diagnostic procedures.

- **Tumor grade and stage:**  
  These parameters, also assessed histologically, reflect the level of differentiation (grade) and the extent of tumor spread (stage).

- **Molecular subtypes:**  
  Based on biological markers and gene expression profiles, these subtypes offer a deeper understanding of tumor biology and behavior.

A major problem in current clinical practice is that suboptimal classification may lead to inappropriate treatments and, consequently, therapy resistance or unnecessary side effects. A more refined classification system, such as molecular subtyping, has the potential to significantly improve diagnostic accuracy and inform personalized treatment decisions.

Research has shown that more precise subtyping contributes to improved OS and better outcomes for patients[[4,5]](#references).

---

### 2. **Molecular Subtypes**

Recent advances in molecular biology have enabled the classification of breast cancer into distinct subtypes based on the expression of specific biomarkers. These subtypes, defined by characteristic molecular signatures, are critical for prognosis and treatment selection.

The four major molecular subtypes of breast cancer are:

- **Luminal A**  
- **Luminal B**  
- **HER2-positive**  
- **Basal-like / TNBC**  

These classifications are primarily determined through the presence or absence of key receptors: ER, PR, and HER2. [[6]](#references)

| Subtype             | Prevalence | Aggressiveness | Receptor Status        |
|---------------------|------------|----------------|-------------------------|
| Luminal A           | 50â€“60%     | Low            | ER+, PR+, HER2âˆ’         |
| Luminal B           | 10â€“20%     | Medium         | ER+, PRÂ±, HER2Â±         |
| HER2-positive       | ~15%       | High           | ERâˆ’, PRâˆ’, HER2+         |
| Basal-like / TNBC   | 10â€“20%     | Very High      | ERâˆ’, PRâˆ’, HER2âˆ’         |

In addition to receptor status, each molecular subtype exhibits a distinct gene expression profile:

- **Luminal A**
  - â†‘ Luminal epithelial genes (e.g., *KRT8*, *KRT18*)
  - â†‘ ER-related genes (e.g., *ESR1*, *GATA3*)
  - â†“ Proliferation-related genes

- **Luminal B**
  - â†‘ Proliferation-related genes (e.g., *MKI67*)
  - â†‘ HER2-related genes (in some cases)
  - â†“ Luminal epithelial gene expression compared to Luminal A

- **HER2-positive**
  - â†‘ HER2 signaling genes (e.g., *ERBB2*, *GRB7*)
  - â†“ ER-related and luminal genes

- **Basal-like / TNBC**
  - â†‘ Basal cytokeratins (e.g., *KRT5*, *KRT14*, *KRT17*)
  - â†‘ Proliferation-related genes
  - â†“ Expression of ER, PR, HER2, and luminal markers

These gene signatures not only define the subtype but also correlate with therapeutic response and overall prognosis, emphasizing the value of molecular subtyping in personalized oncology. This also serves as the basis for gene expression profiling.


---

### 3. **IHC vs. Gene Expression Profiling**

The most commonly used and clinically standardized method for molecular subtyping of breast cancer is IHC[[7]](#references).  
IHC relies on antibody-based staining to detect the presence of key receptorsâ€”ER, PR, and HER2â€”in tumor tissue. It is widely favored in clinical settings due to its **speed**, **affordability**, and **integration into routine diagnostic workflows**.

However, gene expression profiling offers a more nuanced alternative. It assesses the activity of thousands of genes simultaneously using techniques such as RNA-seq. This method provides a comprehensive view of the tumorâ€™s molecular landscape, including pathway activation and proliferation status.

While gene expression profiling is more expensive and technically demanding, requiring RNA extraction, specialized equipment, and bioinformatics expertise, it delivers **higher precision** and **greater biological insight** than IHC alone. This opens the door for more precise and therefore effective treatments.

In research settings or cases where RNA-seq data is already available, gene expression profiling becomes a powerful and cost-effective tool for classification, hypothesis generation, and educational exploration.


---

### 4. **Computational Tools for Gene Expression-Based Subtyping**

Several gene expressionâ€“based classifiers have been developed to predict molecular breast cancer subtypes, the most widely known being **PAM50**[[8]](#references). This assay uses the expression levels of 50 specific genes to categorize tumors into intrinsic subtypes: *Luminal A*, *Luminal B*, *HER2-enriched*, *Basal-like*, and *Normal-like*. PAM50 has been validated in multiple studies[[8,9]](#references) and is commonly used in both research and clinical contexts due to its strong prognostic value.

However, existing classifiers like PAM50 have limitations. They typically function as fixed, non-transparent models and do not provide insight into why a specific classification is made for a given sample. Furthermore, their commercial implementations can be costly, and they often require centralized lab processing.

In this project, we aim to build an **explainable machine learning model** that predicts breast cancer molecular subtypes based on RNA-seq gene expression data. Unlike traditional black-box classifiers, this model will incorporate **SHAP** values to provide biologically interpretable insights into its predictions. 

The model is designed to be:

- **Explainable**: Key features contributing to each classification will be transparent and traceable.
- **Lightweight**: It can be trained and executed on a standard laptop without requiring high-performance computing resources.
- **Open source and free**: Allowing broad accessibility for researchers and students.
- **Educational**: Serving as a tool not only for prediction but also for learning about gene expression, cancer biology, and interpretable AI.

By combining machine learning with explainability, this tool aims to offer a more transparent and accessible alternative to commercial gene expression assays, especially in research and academic settings.

---

### 5. **Conclusion**

While IHC remains the clinical standard for molecular subtyping due to its practicality and cost-efficiency, gene expression profiling offers a more granular view of tumor biology. However, the use of gene expressionâ€“based assays in routine practice is still limited by technical complexity and cost.

This project addresses these limitations by aiming to develop an **interpretable and accessible machine learning model** that leverages RNA-seq data to classify breast cancer into molecular subtypes. Unlike black-box models, the proposed approach integrates explainability to ensure biological transparencyâ€”making it not only a predictive tool but also a valuable platform for research, education, and hypothesis generation.

Ultimately, the long-term value of this project lies in its potential to bridge the gap between computational modeling and clinical decision-making, helping to democratize access to precision oncology tools in both research and low-resource clinical settings.

---

**NOTE**: This is just a very brief and shortened summary of breast cancer and its molecular subtypes, to give necessary context for the following project.

For further and deeper knowledge about how molecular subtyping translates into different forms of treatment and actually improves medical outcomes I recommend reading:

[**Molecular Subtypes and Mechanisms of Breast Cancer: Precision Medicine Approaches for Targeted Therapies**](https://www.mdpi.com/2072-6694/17/7/1102)

---

### Abbreviations

- **ER**: Estrogen Receptors  
- **HER2**: Human Epidermal Growth Factor Receptor 2  
- **IDC**: Invasive Ductal Carcinoma  
- **IHC**: Immunohistochemistry
- **ILC**: Invasive Lobular Carcinoma  
- **OS**: Overall Survival  
- **PR**: Progesterone Receptors  
- **RNA-seq**: RNA Sequencing
- **SHAP**: SHapley Additive exPlanations
- **TNBC**: Triple-Negative Breast Cancer  
---

### References

- [1] Europa Donna â€“ The European Breast Cancer Coalition. (n.d.). *Breast cancer* [https://www.europadonna.org/breast-cancer/](https://www.europadonna.org/breast-cancer/)
- [2] World Cancer Research Fund. (n.d.). *Breast cancer statistics* [https://www.wcrf.org/preventing-cancer/cancer-statistics/breast-cancer-statistics/](https://www.wcrf.org/preventing-cancer/cancer-statistics/breast-cancer-statistics/)
- [3] World Health Organization (WHO). (n.d.). *Breast cancer: Fact sheet* [https://www.who.int/news-room/fact-sheets/detail/breast-cancer](https://www.who.int/news-room/fact-sheets/detail/breast-cancer)
- [4] Verma, S., Miles, D., Gianni, L., et al. (2012). *Trastuzumab emtansine for HER2-positive advanced breast cancer*. New England Journal of Medicine, 367(19), 1783â€“1791. [https://pubmed.ncbi.nlm.nih.gov/23020162/](https://pubmed.ncbi.nlm.nih.gov/23020162/)
- [5] Jhaveri, K., et al. (2024). *Inavolisib plus palbociclibâ€“fulvestrant in PIK3CA-mutated HR+/HER2â€“ advanced breast cancer (INAVO120 trial)*. New England Journal of Medicine. [https://pubmed.ncbi.nlm.nih.gov/40454641/](https://pubmed.ncbi.nlm.nih.gov/40454641/)
- [6] Prat, A., & Perou, C. M. (2015). *The molecular subtypes of breast cancer: prognostic and therapeutic implications*. Clinical Cancer Research. [https://pubmed.ncbi.nlm.nih.gov/26253814/](https://pubmed.ncbi.nlm.nih.gov/26253814/)
- [7] Arpita, J. (2022). *A Study of Molecular Subtypes of Carcinoma Breast by Immunohistochemical Surrogates for Molecular Classification*. Asian Pacific Journal of Cancer Biology. [https://waocp.com/journal/index.php/apjcb/article/view/661](https://waocp.com/journal/index.php/apjcb/article/view/661)
- [8] Nielsen, T., Wallden, B., Schaper, C., et al. (2015). *Development and verification of the PAM50â€‘based Prosigna breast cancer gene signature assay*. BMC Medical Genomics. [https://bmcmedgenomics.biomedcentral.com/articles/10.1186/s12920-015-0129-6](https://bmcmedgenomics.biomedcentral.com/articles/10.1186/s12920-015-0129-6)
- [9] Liu, M. C., et al. (2016). *PAM50 gene signatures and breast cancer prognosis with contemporary treatment*. npjâ€¯Breastâ€¯Cancer. [https://pubmed.ncbi.nlm.nih.gov/27857199/](https://pubmed.ncbi.nlm.nih.gov/27857199/)
