In [None]:
from lifelines import CoxPHFitter, KaplanMeierFitter
from lifelines.plotting import add_at_risk_counts
from lifelines.statistics import logrank_test
from pandas.api.types import CategoricalDtype

import matplotlib.pyplot as pp
import pandas as pd

## Load data

In [None]:
file_name = "../data/Survival Dataset.xlsx"
df = pd.read_excel(file_name)
df["Sex"] = df["Sex"].astype(CategoricalDtype(categories=["Female", "Male"], ordered=False))
df["P53 Bin"] = df["P53 Bin"].astype(
    CategoricalDtype(categories=["WT", "Mutated"], ordered=True)
)
df.head()

In [None]:
df.columns

## Univariable survival

Let's start with a simple plot of overall survival across the cohort.

The first step is to get our time and even variables.
We have to be careful about how we encode the censoring event.
The [lifelines](https://lifelines.readthedocs.io/en/latest) package we are using encodes observed events.
This is the opposite of the dataset so we use 1 - event to change things.

In [None]:
# Get our time and censoring variables
time = df["Survival Time"]
# Lifelines encodes events a ones, which is the opposite of the dataset
event = 1 - df["Disease Specific Censor"]

Now we have our data setup, we can fit the model.

In [None]:
# Fit the KM curve
kmf = KaplanMeierFitter()
kmf.fit(time, event_observed=event)

Next we can plot the results.
Remember this will be for everyone in the cohort.

In [None]:
kmf.plot_survival_function()

We can add the risk table as well.

In [None]:
kmf.plot_survival_function(at_risk_counts=True)

We can also get rid of the confidence intervals if we want.

In [None]:
kmf.plot_survival_function(at_risk_counts=True, ci_show=False)

### Comparing groups

Now let's compare surival by sex.

Do do this we will fit on both male and femal data separately and plot.
We'll start by finding the male observations

In [None]:
male_idx = df["Sex"] == "Male"
male_idx

Plotting just the males.

In [None]:
kmf.fit(time[male_idx], event_observed=event[male_idx])
kmf.plot_survival_function()

To get the female index, we can negate the male index.
In Python the `~` operator does a boolean not operation i.e. flips True->False and False->True

>You could also do `df["Sex"] == "Female"`.

In [None]:
~male_idx

In [None]:
kmf.fit(time[~male_idx], event_observed=event[~male_idx])
kmf.plot_survival_function()

Now doing both together.

In [None]:
kmf.fit(time[male_idx], event_observed=event[male_idx], label="Male")
kmf.plot_survival_function()

kmf.fit(time[~male_idx], event_observed=event[~male_idx], label="Female")
kmf.plot_survival_function()

Getting the risk table takes a bit more work.
We need to create a `Figure`, then add an `Axes` to be shared by the plots.

In [None]:
# Plotting area
fig = pp.figure()
ax = fig.add_subplot(1, 1, 1)
# Male curve
kmf_m = KaplanMeierFitter()
kmf_m.fit(time[male_idx], event_observed=event[male_idx], label="Male")
kmf_m.plot_survival_function(ax=ax)
# Female curve
kmf_f = KaplanMeierFitter()
kmf_f.fit(time[~male_idx], event_observed=event[~male_idx], label="Female")
kmf_f.plot_survival_function(ax=ax)
# Set the x axis label
ax.set_xlabel("Survival time")
# Add the risk table
add_at_risk_counts(kmf_m, kmf_f, ax=ax)
# Fixes up the spacing
fig.tight_layout()

Let's use the p53 column.
We will use a `for` loop to reduce the redundancy in the code.

In [None]:
# Plotting area
fig = pp.figure()
ax = fig.add_subplot(1, 1, 1)
for val in df["P53 Bin"].unique():
    time = df.loc[df["P53 Bin"] == val, "Survival Time"]
    event = df.loc[df["P53 Bin"] == val, "Disease Specific Censor"]
    event = 1 - event
    kmf = KaplanMeierFitter()
    kmf_m.fit(time, event_observed=event, label=val)
    kmf_m.plot_survival_function(ax=ax)
# Set the x axis label
ax.set_xlabel("Survival time")
# Add the risk table
add_at_risk_counts(kmf_m, kmf_f, ax=ax)
# Fixes up the spacing
fig.tight_layout()

### Significance testing

We can do the log rank test for signficance.
We'll need to setup the data for this again.

In [None]:
time = df["Survival Time"]
event = 1 - df["Disease Specific Censor"]
wt_idx = df["P53 Bin"] == "WT"

In [None]:
results = logrank_test(time[wt_idx], time[~wt_idx], event[wt_idx], event[~wt_idx])
results.print_summary()

In [None]:
results.p_value

## Multivariable analysis

We can also do a Cox proportional hazard analysis.
We will pass our DataFrame directly in and specify arguments for this analysis.
Thus, we need to fix the event column coding.

In [None]:
df["event"] = 1 - df["Disease Specific Censor"]

Now we can fit a Cox model.
We will use sex and p53 status first.

> We need to do some work in the for p53 column because it contains white space.
> Specifically, we use ' to wrap the formula and Q("P53 Bin") to deal with the p53 bin.

In [None]:
cph = CoxPHFitter()
cph.fit(df, duration_col="Survival Time", event_col="event", formula='Sex + Q("P53 Bin")')
cph.print_summary()

So it seems sex and p53 status are both signficant under this model.
Let's try adding an interaction between the two.
Let's try a simpler model with just p53 status and see if it is a better fit.

In [None]:
cph = CoxPHFitter()
cph.fit(df, duration_col="Survival Time", event_col="event", formula='Q("P53 Bin")')
cph.print_summary()

The partial AIC indicates this is not a better model.

Let's try adding an interaction term.

In [None]:
cph = CoxPHFitter()
cph.fit(df, duration_col="Survival Time", event_col="event", formula='Sex + Q("P53 Bin") + Sex * Q("P53 Bin")')
cph.print_summary()

Adding an interaction does not improve the model fit either.

Let's go back to our best model and visualize it with a forest plot.

In [None]:
cph = CoxPHFitter()
cph.fit(df, duration_col="Survival Time", event_col="event", formula='Sex + Q("P53 Bin")')
cph.plot()

By default lifelines plots the log hazard ratios.
To align with our notes we can show the hazard ratios instead.

In [None]:
cph.plot(hazard_ratios=True)

Let's include a continuous variable, age, in the model.

In [None]:
cph = CoxPHFitter()
cph.fit(df, duration_col="Survival Time", event_col="event", formula='Sex + Q("P53 Bin") + Q("Age at Surgery")')
cph.plot(hazard_ratios=True)

The plot is a bit ugly, but it is just a matplotlib `Axes` so we can make some changes.

Let's start by fixing the y-axis labels.

In [None]:
ax = cph.plot(hazard_ratios=True)
# Note we start from the bottom up in the labelling
ax.set_yticklabels(["Age at surgery", "Male", "P53 Mut"])

- TODO: Manual plotting