SURVIVAL ANALYSIS 


developed to quantify and measure the lifespans of individuals (how likely someone was to die or survive). For example, if you wanted to know what the probability of death over time was for a certain a group of people who had contracted some illness, you would use survival analysis. However, the techniques used in survival analysis can be applied to a wide variety of situations.


There are certain requirements for conducting a survival analysis. The first of these is that you need to have a field in your data that represents the occurrence (or lack thereof) of an event.

The second requirement is that you need to have a field that represents the passage of time in some fashion. This can be datetimes or it can be a number that increases with the passage of time, such as number of days/weeks/months/years, the number of uses, distance traveled, etc.

In [7]:
import pandas as pd 
import chart_studio.plotly as py
import cufflinks as cf 

In [19]:
cf.go_offline()

The third and final requirement is that you need to have at least one field to group by. 
Survival analysis calculates the survival rates for groups that something in common. 
The group by field is the thing they have in common. 
If you have multiple fields you can group by, you can perform the analysis for the
groups represented by each field individually or you can combine them into more 
granular segments and examine the survival rates of those. However, you can only
do this up to a point where the number of examples in a group become too few 
to reliably calculate their survival rates. So you need to balance the ability drill
down into granular details with the need to have large enough groups to calculate meaningful and reliable survival rates.

In [32]:
data = pd.read_csv(r'C:\Users\Yael Aguilar\Documents\clasesdata_102020\data\churn.csv')
data.head()


Unnamed: 0,customerID,gender,SeniorCitizen,Partner,Dependents,tenure,PhoneService,MultipleLines,InternetService,OnlineSecurity,...,Contract,PaperlessBilling,PaymentMethod,MonthlyCharges,TotalCharges,Churn,MonthLevel,TotalLevel,TenureLevel,ChurnBinary
0,7590-VHVEG,Female,0,Yes,No,1,No,No phone service,DSL,No,...,Month-to-month,Yes,Electronic check,29.85,29.85,No,Low,Very Low,New,0.0
1,5575-GNVDE,Male,0,No,No,34,Yes,No,DSL,Yes,...,One year,No,Mailed check,56.95,1889.5,No,Low,Moderate,Loyal,0.0
2,3668-QPYBK,Male,0,No,No,2,Yes,No,DSL,Yes,...,Month-to-month,Yes,Mailed check,53.85,108.15,Yes,Low,Very Low,New,1.0
3,7795-CFOCW,Male,0,No,No,45,No,No phone service,DSL,Yes,...,One year,No,Bank transfer (automatic),42.3,1840.75,No,Low,Moderate,Loyal,0.0
4,9237-HQITU,Female,0,No,No,2,Yes,No,Fiber optic,No,...,Month-to-month,Yes,Electronic check,70.7,151.65,Yes,Moderate,Very Low,New,1.0


In [22]:
data['ChurnBinary'].unique()

array([0., 1.])

Now that we have identified the data we want to use and the corresponding fields, we can begin our analysis. The objective of our analysis is going to be to get a sense of the decline in customer retention rates as tenure increases among the different groups of customers we have in our data set.

The easiest way to do this in Python is with the lifelines library. We are going to import the KaplanMeierFitter from lifelines. Kaplain-Meier curves are one of the most common statistical techniques used in survival analysis, and they will help us estimate the survival function of the different cohorts we designate (e.g. males vs. females, senior citizens vs. non-seniors, etc.).

In [23]:

from lifelines import KaplanMeierFitter
from lifelines.datasets import load_waltons

From there, we are going to define a survival function that is going to accept
a data frame and then specify each of the three necessary components for conducting
a survival analysis (a group field, a time field, and an event field). 
Within the function, we are going to define our model and then create an empty
list in which we are going to store our results. From there, we loop through
the each of the unique values in the group field, grab the data in the time and
event fields, fit our model to the data, and append the survival function results
to our results list. At the end, we concatenate the data for each unique group 
field value together and then return those results.

In [24]:
def survival(data, group_field, time_field, event_field):
    model = KaplanMeierFitter()
    results = []
    
    for i in data[group_field].unique():
        group = data[data[group_field] == i]
        T = group[time_field]
        E = group[event_field]
        model.fit(T,E, label = str(i))
        results.append(model.survival_function_)
    survival = pd.concat(results, axis = 1)
    return survival

In [30]:
rates = survival(data, 'InternetService', 'tenure', 'ChurnBinary')
rates.iplot(kind='line', xTitle='Tenure (Months)', yTitle='Retention Rate', title = 'Retention Rates by Tenure and Gender')

In [33]:
rates = survival(data, 'InternetService', 'tenure', 'ChurnBinary')

rates.iplot(kind='line', xTitle='Tenure(Months)', yTitle='Retention Rate', title='Retention Rates by Tenure and Senior Citizen')
