## CIBMTR Survival Prediction
    This notebook demonstrates how to build **machine learning and survival analysis models** to predict patient survival using the CIBMTR dataset.
    
    **Workflow**
    1. Data Loading & Preprocessing
    2. Exploratory Data Analysis (EDA)
    3. Feature Engineering
    4. Model Training (Logistic Regression, Random Forest, XGBoost)
    5. Evaluation (Accuracy, ROC-AUC, Log-loss)
    6. Survival Analysis (Kaplan-Meier, CoxPH)
    7. Conclusions

## 1. Setup
  Import libraries and helper functions from `src/`.

In [None]:
import pandas as pd
from src.preprocessing import load_data, preprocess_data
from src.modeling import train_logistic, train_random_forest, train_xgboost
from src.evaluation import evaluate_classifier, plot_feature_importance, survival_analysis, cox_analysis

## 2. Load Dataset
  The CIBMTR dataset should be placed inside the `data/` folder.
  For privacy reasons, raw data is **not included** in the repo.

In [None]:
data_path = "/data/cibmtr.csv " # replace with actual file name
df = load_data(data_path)
df.head()

: 

## 3. Preprocessing
  - Handle categorical variables
  - Scale numerical features
  - Train/test split

In [None]:
X_train, X_test, y_train, y_test = preprocess_data(df, target="survival")

## 4. Train Models
  We compare **Logistic Regression, Random Forest, and XGBoost**.

In [None]:
log_reg = train_logistic(X_train, y_train)
rf = train_random_forest(X_train, y_train)
xgb = train_xgboost(X_train, y_train)

## 5. Evaluation
  We use **accuracy, ROC-AUC, and log-loss** for model comparison.

In [None]:
results = {
    "Logistic Regression": evaluate_classifier(log_reg, X_test, y_test),
    "Random Forest": evaluate_classifier(rf, X_test, y_test),
    "XGBoost": evaluate_classifier(xgb, X_test, y_test)
}
pd.DataFrame(results).T


### Feature Importance (Random Forest)


In [None]:
plot_feature_importance(rf, X_train.columns)

## 6. Survival Analysis
Using **Kaplan-Meier** and **Cox Proportional Hazards**.


In [None]:
# Example survival analysis (requires time-to-event + event indicator columns)
survival_analysis(df, time_col="time_to_event", event_col="event")

# Cox Proportional Hazards
cox_analysis(df, time_col="time_to_event", event_col="event")


## 7. Conclusions
- Random Forest and XGBoost provide strong predictive performance.  
- Kaplan-Meier curves give a clear view of survival probabilities.  
- CoxPH provides interpretability of covariates.
