# Module 2 Practice 1 Answers - Performing a Survival Analysis
In this practice exercise, you will perform the steps of a survival analysis on a new dataset.

The dataset we will be using is another leukemia survival.  In this data, the groups are denoted by a test result.  There is no censored data in this dataset, all patients died from acute myelogenous leukaemia and were recorded during the study.  There is an independent variable `wbc` - white blood cell count.

Documentation is [here](../resources/leukaemia-wbc.html)

In [None]:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import sys
!{sys.executable} -m pip install lifelines
from lifelines import KaplanMeierFitter
from lifelines.statistics import logrank_test
from lifelines import CoxPHFitter

In [None]:
data = pd.read_csv('../resources/leukaemia-wbc.csv', index_col=0)

display(data)

## Create a column to represent the event
This will be a constant value since every patient died, but is needed for the `lifelines` methods.

In [None]:
data['death'] = 1

## Create a numeric category for the two groups
Create a numeric column that can be used in regression to represent the two groups in the data

In [None]:
# create a binary numeric variable for the two groups
# 0 = absent
# 1 = present
data['group'] = data['ag'].apply(lambda x: 1 if x == 'present' else 0)


## Calculate the event tables for each group

In [None]:
# First we will break the data into two views for convenience
data_present = data[data['ag'] == 'present']
data_absent = data[data['ag'] == 'absent']

# create two Kaplan-Meier fitters
kmf_present = KaplanMeierFitter()
kmf_absent = KaplanMeierFitter()

kmf_present.fit(durations = data_present['time'], event_observed = data_present['death'], label='present')
kmf_absent.fit(durations = data_absent['time'], event_observed = data_absent['death'], label='absent')

event_table_present = kmf_present.event_table
event_table_absent = kmf_absent.event_table

## Add the cumulative survival probability to each event table and print them

In [None]:
event_table_present['cumulative_S'] = kmf_present.survival_function_
event_table_absent['cumulative_S'] = kmf_absent.survival_function_

display(event_table_present)
display(event_table_absent)

## Plot the Kaplan-Meier curve for each group

In [None]:
kmf_present.plot()
kmf_absent.plot()

_ = plt.xlabel('Weeks')
_ = plt.ylabel('Probability of Survival')

## Test the hypothesis that the survival distribution is different for the two groups

In [None]:
results = logrank_test(data_present['time'], data_absent['time'], event_observed_A=data_present['death'], event_observed_B=data_absent['death'])

results.print_summary()
print(results.p_value)

## Interpret the results using appropriate language

The data support the hypothesis that the survival distribution is different for the two groups.

## Find the hazard ratio between the two groups

In [None]:
cph = CoxPHFitter()
cph.fit(data[['time', 'death', 'group']], duration_col='time', event_col='death')

cph.print_summary()

## Interpret the hazard ratio in written form

Being in the group with a test result of absent (group = 0) increases the hazard of death by 0.31 times as compared to subjects with a test result of present (group = 1).