# Breast Cancer Survival Analysis: Study Protocol

## 1. Study Objective

**Primary Objective**: 
Develop and compare prognostic models for breast cancer survival using 
clinical variables and gene expression data.

**Secondary Objectives**:
- Identify gene expression signatures associated with survival outcomes
- Compare predictive performance of traditional statistical vs. regularized 
  machine learning approaches
- Demonstrate methods applicable to precision oncology real-world evidence

---

## 2. Data Source

**Dataset**: METABRIC Breast Cancer Gene Expression and Clinical Data
**Source**: Kaggle (https://www.kaggle.com/datasets/...)

**Expected Sample**:
- Approximately 1,500-2,000 patients
- Metastatic/advanced breast cancer
- Gene expression profiles + clinical outcomes

---

## 3. Study Population

**Inclusion Criteria** (pre-specified):
- Patients with complete survival data (time and event status)
- Survival time > 0 months
- At least one clinical covariate available

**Exclusion Criteria**:
- Missing survival time or event indicator
- Patients with survival time = 0 (data quality issue)

**Handling Missing Data**:
- Clinical variables: Complete case analysis for Cox models
- Gene expression: Median imputation (justify if needed)
- Sensitivity analysis: Compare complete cases vs. imputed

---

## 4. Study Variables

### **Outcome Variable**
- **Primary**: Overall survival (OS)
  - Time: Months from diagnosis to death or last follow-up
  - Event: Death (1) vs. censored (0)

### **Clinical Covariates** (pre-specified)
- Age at diagnosis (continuous)
- Tumor stage (categorical: I, II, III, IV or T1-T4)
- ER status (positive/negative)
- PR status (positive/negative)  
- HER2 status (positive/negative)
- Tumor grade (1/2/3 or low/intermediate/high)

### **Genomic Features**
- Gene expression values (continuous, log-transformed if needed)
- Specific genes of interest (ESR1, ERBB2, TP53, PIK3CA - if available)
- Genome-wide expression for feature selection

---

## 5. Statistical Analysis Plan

### **5.1 Exploratory Data Analysis**
- Descriptive statistics (median, IQR for continuous; frequencies for categorical)
- Distribution of survival times
- Event rate (% deaths vs. censored)
- Median follow-up time
- Missing data patterns

### **5.2 Kaplan-Meier Analysis**
- Overall survival curve with 95% CI
- Median survival time
- 5-year survival probability

**Stratified Analysis** (pre-specified):
- By ER status
- By tumor stage
- By HER2 status (if available)

**Statistical Test**: Log-rank test for equality of survival curves (α = 0.05)

### **5.3 Cox Proportional Hazards Models**

**Model 1: Clinical Variables Only**
- Covariates: Age, stage, ER status, PR status, HER2 status, grade
- Check proportional hazards assumption (Schoenfeld residuals, p > 0.05)
- Report: Hazard ratios (HR), 95% CI, p-values
- Performance: Concordance index (C-index)

**Model 2: Gene Expression (LASSO Regularization)**
- Method: CoxNet with LASSO penalty (L1_ratio = 1.0)
- Cross-validation: 5-fold CV to select optimal alpha
- Feature selection: Genes with non-zero coefficients
- Performance: C-index

**Model 3: Combined Clinical + Selected Genes**
- Include clinical variables + genes selected by LASSO
- Performance: C-index
- Comparison: Bootstrap 95% CI for C-index difference

### **5.4 Model Validation**
- **Internal validation**: Train/test split (70/30)
  - Random seed: 42 (for reproducibility)
  - Stratify by event status
- **Calibration**: Compare predicted vs. observed survival (if sample size permits)
- **Discrimination**: C-index on test set

### **5.5 Sensitivity Analyses** (if time permits)
- Complete case vs. median imputation for gene expression
- Alternative regularization: Elastic Net (L1_ratio = 0.5)
- Time-varying effects for variables violating PH assumption

---

## 6. Statistical Considerations

**Sample Size**: 
- No formal power calculation (secondary data analysis)
- Rule of thumb: ≥10 events per predictor variable
- For high-dimensional genomic data, use regularization to prevent overfitting

**Multiple Testing**:
- For gene expression: Control via regularization (LASSO), not multiple testing correction
- For clinical variables: No adjustment (hypothesis-generating, not confirmatory)

**Missing Data**:
- Report % missing for all variables
- Clinical: Complete case analysis (assume MCAR)
- Genes: Median imputation (justify if % missing < 10%)

**Significance Level**: α = 0.05 (two-sided)

---

## 7. Performance Metrics

| Metric | Definition | Target |
|--------|------------|--------|
| C-index | Concordance index (discrimination) | > 0.70 |
| Median survival | Time when S(t) = 0.50 | Report with 95% CI |
| 5-year survival | Probability at 60 months | Report with 95% CI |
| Model comparison | ΔC-index (genomic vs. clinical) | > 0.05 meaningful |

---

## 8. Software & Reproducibility

**Environment**:
- Python 3.9+
- Key packages: pandas, lifelines, scikit-survival, matplotlib, seaborn
- Random seed: 42 (all analyses)

**Code Structure**:
1. `01_data_loading.ipynb`: Load and initial exploration
2. `02_data_cleaning.ipynb`: Apply inclusion/exclusion, handle missing
3. `03_eda.ipynb`: Descriptive stats, KM curves
4. `04_cox_models.ipynb`: Cox regression analyses
5. `05_genomic_models.ipynb`: LASSO feature selection
6. `06_model_comparison.ipynb`: Performance comparison, visualization

---

## 9. Expected Deliverables

1. **Analysis Code**: Jupyter notebooks with documented workflow
2. **Results Summary**: Markdown file with key findings
3. **Visualizations**:
   - KM curves (overall, stratified)
   - Forest plot of hazard ratios
   - Model comparison (C-index plot)
4. **Model Performance Table**: Comparison of 3 models
5. **GitHub Repository**: Public portfolio project

---

## 10. Limitations (Pre-Specified)

- **Retrospective observational data**: Cannot establish causality
- **Selection bias**: Dataset may not represent all breast cancer patients
- **Generalizability**: Single dataset, may not generalize to other populations
- **Unmeasured confounding**: Clinical variables may not capture all prognostic factors
- **High-dimensional challenge**: More genes than samples, requires regularization
- **No external validation**: Limited to internal train/test split

---

## 11. Interpretation Framework

**What success looks like**:
- Clinical model C-index > 0.65 (reasonable discrimination)
- Genomic features improve C-index by ≥ 0.05
- Selected genes have biological plausibility (ER, HER2, proliferation genes)
- Results align with known breast cancer biology

**Red flags to investigate**:
- C-index < 0.60 (poor discrimination)
- Proportional hazards violations for key variables
- Selected genes have no known cancer association (overfitting)
- Large train/test performance gap (overfitting)

---

## Version History
- v1.0 (2025-01-30): Initial analysis plan
- Updates will be documented with rationale