## ðŸ“˜ Notebook 01: Project Overview

---
### **Abstract**

Molecular subtyping of breast cancer plays a vital role in prognosis and treatment decisions. While established methods like PAM50 and commercial assays such as Oncotype DX offer valuable insights, they are often rigid, closed-source, and costly. This project presents an open, explainable machine learning pipeline trained on RNA-Seq gene expression data from TCGA-BRCA to classify breast cancer into molecular subtypes (e.g., Luminal A/B, HER2-enriched, Basal-like).

Unlike traditional models, this approach emphasizes interpretability through SHAP, enabling transparent, patient-specific explanations of predictions. The goal is not to replace existing tools, but to demonstrate that accurate and reproducible subtype classification can be achieved using open data and transparent algorithms, making molecular subtyping more accessible, flexible, and understandable. By combining explainable AI and high-dimensional omics data, this project offers a research-grade framework for advancing precision oncology, especially in educational and resource-limited settings.

---

### 1. **Project Objective**

The goal of this project is to develop an explainable machine learning model for the classification of molecular subtypes of breast cancer based on gene expression data (e.g., TCGA-BRCA).  

The model is intended not only to provide accurate predictions, but also to explain its decisions using interpretability techniques such as SHAP.  
The focus lies on biological relevance, interpretability, and scientific reproducibility.

---

### 2. **Motivation**

Through this project, I aim to learn how to work with real-world, complex biological and medical datasets (e.g., TCGA), to understand, analyze, and systematically process them.  
Additionally, I want to design, train, evaluate, and interpret an explainable ML model in Python.

This project is meant to give me an initial insight into the world of bioinformatic research, not just technically, but also in terms of scientific thinking.  
It marks my entry into a field I aspire to contribute to in the long term.  
I want to demonstrate not only my interest, but also my dedication, passion, and drive to make an impact in bioinformatics.

---

### 3. **Tools & Technologies**

- *Python* as the main programming language (in Jupyter Notebooks)  

- *pandas*, *NumPy* for data handling  

- *matplotlib*, *seaborn* for visualization  

- *scikit-learn*, *XGBoost* for modeling  

- *SHAP* for explainable model interpretation  

- *Virtual environments* for clean package management  

- *GitHub* for reproducibility & version control  

- *TCGA-BRCA* gene expression data via *UCSC Xena*


---

### 4. **Methodologocial Approach**

- **Exploratory Data Analysis (EDA)** to understand structure and identify patterns  

- **Feature Engineering**: filtering, normalization, and potentially feature selection  

- **Machine Learning Classification** using XGBoost and comparison models (e.g., Random Forest)  

- **Model Evaluation** with metrics such as accuracy, ROC-AUC, and confusion matrix  

- **Interpretability** via SHAP values to ensure biological transparency  

- **Validation** through k-fold cross-validation

---

### 5. **Project Phases**

- **Phase 0: Background Research**  
  Literature review on breast cancer subtypes, gene expression, and explainable AI (XAI); familiarization with TCGA and RNA-Seq data formats. 

- **Phase 1: Project Definition**  
 Formalize goals, establish project scope, and design notebook and file structure.

- **Phase 2: Data Acquisition & Preprocessing**  
 Download and preprocess TCGA-BRCA RNA-Seq data for analysis.

- **Phase 3: Model Development & Evaluation**  
 Implement classification models and evaluate their performance.

- **Phase 4: Model Interpretation**  
 Analyze model outputs using SHAP to assess biological relevance.

- **Phase 5: Documentation & Results Preparation**  
 Summarize findings, structure results, and prepare visualizations and narratives for presentation.
---

### **Next Steps**

- Begin data acquisition and initial preprocessing (Phase 2)
- Explore and document TCGA-BRCA dataset structure
- Design and implement exploratory data analysis notebook