# 04 – Survival Analysis

In this notebook, we will:

- Understand what survival analysis is and when to use it
- Explore censoring and time-to-event data
- Use Kaplan–Meier plots and Cox proportional hazards regression
- Run survival models on synthetic public health data


## ⏳ What is Survival Analysis?

Survival analysis is used when the outcome is the **time until an event occurs** (e.g., death, diagnosis, dropout).

Key features:
- **Censoring**: We don’t observe the event for everyone within the study period.
- **Survival function**: Probability of surviving beyond time *t*.
- **Hazard function**: Instantaneous risk at time *t*.

We’ll explore both non-parametric (Kaplan–Meier) and semi-parametric (Cox) models.


In [None]:
import pandas as pd
import numpy as np
import seaborn as sns
import matplotlib.pyplot as plt
from lifelines import KaplanMeierFitter, CoxPHFitter

# Load dataset
df = pd.read_csv('https://raw.githubusercontent.com/ggkuhnle/FB2NEP_datascience/main/data/fb2nep_data.csv')

# Add synthetic survival data
np.random.seed(11088)
df['time'] = np.random.exponential(scale=10, size=len(df)).round(1)
df['event'] = np.random.binomial(1, 0.7, size=len(df))

## 📉 Kaplan–Meier Estimator

In [None]:
kmf = KaplanMeierFitter()
kmf.fit(df['time'], event_observed=df['event'])

# Plot
plt.figure(figsize=(8, 5))
kmf.plot()
plt.title('Kaplan–Meier Survival Curve')
plt.xlabel('Time')
plt.ylabel('Survival Probability')
plt.grid(True)
plt.show()

### Compare survival curves by sex

In [None]:
fig, ax = plt.subplots(figsize=(8, 5))
for label, grouped_df in df.groupby('sex'):
    kmf.fit(grouped_df['time'], grouped_df['event'], label=label)
    kmf.plot(ax=ax)
plt.title('Survival by Sex')
plt.xlabel('Time')
plt.ylabel('Survival Probability')
plt.grid(True)
plt.show()

## 🔧 Cox Proportional Hazards Model

In [None]:
cph = CoxPHFitter()
cph_df = df[['time', 'event', 'age', 'bmi', 'sex']].copy()
cph_df['sex'] = cph_df['sex'].astype('category')
cph_df = pd.get_dummies(cph_df, drop_first=True)

cph.fit(cph_df, duration_col='time', event_col='event')
cph.print_summary()

## 📘 Interpretation

- Hazard ratios (HR > 1: higher risk, HR < 1: lower risk)
- Check p-values and confidence intervals
- Note: Assumes proportional hazards over time

Explore age, sex, and BMI effects.


## 🧠 Exercise

1. Add a new variable (e.g. smoker) and test if it affects survival.
2. Try stratifying Kaplan–Meier curves by BMI category.
3. Check if your Cox model still works when adding more variables.

✍️ Add your reflections below.


## 🧪 Playground – experiment here

In [None]:
# Your code here