# Survival Analysis Lab

Complete the following exercises to solidify your knowledge of survival analysis.

In [1]:
import pandas as pd
import plotly.plotly as py
import cufflinks as cf
from lifelines import KaplanMeierFitter

cf.go_offline()

In [2]:
data = pd.read_csv('../data/attrition.csv')
data.head()

Unnamed: 0,Age,Attrition,BusinessTravel,DailyRate,Department,DistanceFromHome,Education,EducationField,EmployeeCount,EmployeeNumber,...,RelationshipSatisfaction,StandardHours,StockOptionLevel,TotalWorkingYears,TrainingTimesLastYear,WorkLifeBalance,YearsAtCompany,YearsInCurrentRole,YearsSinceLastPromotion,YearsWithCurrManager
0,41,1,Travel_Rarely,1102,Sales,1,2,Life Sciences,1,1,...,1,80,0,8,0,1,6,4,0,5
1,49,0,Travel_Frequently,279,Research & Development,8,1,Life Sciences,1,2,...,4,80,1,10,3,3,10,7,1,7
2,37,1,Travel_Rarely,1373,Research & Development,2,2,Other,1,4,...,2,80,0,7,3,3,0,0,0,0
3,33,0,Travel_Frequently,1392,Research & Development,3,4,Life Sciences,1,5,...,3,80,0,8,3,3,8,7,3,0
4,27,0,Travel_Rarely,591,Research & Development,2,1,Medical,1,7,...,4,80,1,6,3,3,2,2,2,2


In [3]:
data.columns

Index(['Age', 'Attrition', 'BusinessTravel', 'DailyRate', 'Department',
       'DistanceFromHome', 'Education', 'EducationField', 'EmployeeCount',
       'EmployeeNumber', 'EnvironmentSatisfaction', 'Gender', 'HourlyRate',
       'JobInvolvement', 'JobLevel', 'JobRole', 'JobSatisfaction',
       'MaritalStatus', 'MonthlyIncome', 'MonthlyRate', 'NumCompaniesWorked',
       'Over18', 'OverTime', 'PercentSalaryHike', 'PerformanceRating',
       'RelationshipSatisfaction', 'StandardHours', 'StockOptionLevel',
       'TotalWorkingYears', 'TrainingTimesLastYear', 'WorkLifeBalance',
       'YearsAtCompany', 'YearsInCurrentRole', 'YearsSinceLastPromotion',
       'YearsWithCurrManager'],
      dtype='object')

## 1. Generate and plot a survival function that shows how employee retention rates vary by gender and employee age.

*Tip: If your lines have gaps in them, you can fill them in by using the `fillna(method=ffill)` and the `fillna(method=bfill)` methods and then taking the average. We have provided you with a revised survival function below that you can use for the exercises in this lab*

In [4]:
def survival(data, group_field, time_field, event_field):
    kmf = KaplanMeierFitter()
    results = []

    for i in data[group_field].unique():
        group = data[data[group_field]==i]
        T = group[time_field]
        E = group[event_field]
        kmf.fit(T, E, label=str(i))
        results.append(kmf.survival_function_)

    survival = pd.concat(results, axis=1)
    front_fill = survival.fillna(method='ffill')
    back_fill = survival.fillna(method='bfill')
    smoothed = (front_fill + back_fill) / 2
    return smoothed

In [5]:
rates = survival(data, 'Gender', 'Age', 'Attrition')[1:] 
# Se quita el primer valor que es 0 años y no tiene sentido, así empieza a partir de 18 años.

rates.iplot(kind='line', xTitle='Age', yTitle='Employee Retention Rates', 
            title='Employee Retention Rates or Attrition by Gender and Age')

## 2. Compare the plot above with one that plots employee retention rates by gender over the number of years the employee has been working for the company.

In [6]:
rates = survival(data, 'Gender','YearsAtCompany', 'Attrition') 

rates.iplot(kind='line', xTitle='YearsAtCompany', yTitle='Employee Retention Rates',
            title='Employee Retention Rates by Gender and Years At Company')

# Podemos ver que la línea de Male se frena en seco porque se acaban los datos del dataset, 
# no es que sea un error.


## 3. Let's look at retention rate by gender from a third perspective - the number of years since the employee's last promotion. Generate and plot a survival curve showing this.

In [7]:
rates = survival(data, 'Gender','YearsSinceLastPromotion', 'Attrition')

rates.iplot(kind='line', xTitle='YearsSinceLastPromotion', yTitle='Employee Retention Rates',
            title='Employee Retention Rates by Gender and Years Since Last Promotion')

## 4. Let's switch to looking at retention rates from another demographic perspective: marital status. Generate and plot survival curves for the different marital statuses by number of years at the company.

In [8]:
rates = survival(data, 'MaritalStatus','YearsAtCompany', 'Attrition')

rates.iplot(kind='line', xTitle='Years At Company', yTitle='Employee Retention Rates',
            title='Employee Retention Rates by Marital Status and Years At Company')

# Podemos ver que la línea de Divorced se frena en seco porque se acaban los datos del dataset, 
# no es que sea un error.


## 5. Let's also look at the marital status curves by employee age. Generate and plot the survival curves showing retention rates by marital status and age.

In [9]:
rates = survival(data, 'MaritalStatus','Age', 'Attrition')[1:] 
# Se quita el primer valor ya que de 0 a 18 años no hay gente casada ni divorciada

rates.iplot(kind='line', xTitle='Age', yTitle='Employee Retention Rates',
            title='Employee Retention Rates by Marital Status and Age')


## 6. Now that we have looked at the retention rates by gender and marital status individually, let's look at them together. 

Create a new field in the data set that concatenates marital status and gender, and then generate and plot a survival curve that shows the retention by this new field over the age of the employee.

In [10]:
data['Gender_MaritalStatus'] = data[['Gender', 'MaritalStatus']].apply(lambda x: '-'.join(x), axis=1)

display(data.head())

Unnamed: 0,Age,Attrition,BusinessTravel,DailyRate,Department,DistanceFromHome,Education,EducationField,EmployeeCount,EmployeeNumber,...,StandardHours,StockOptionLevel,TotalWorkingYears,TrainingTimesLastYear,WorkLifeBalance,YearsAtCompany,YearsInCurrentRole,YearsSinceLastPromotion,YearsWithCurrManager,Gender_MaritalStatus
0,41,1,Travel_Rarely,1102,Sales,1,2,Life Sciences,1,1,...,80,0,8,0,1,6,4,0,5,Female-Single
1,49,0,Travel_Frequently,279,Research & Development,8,1,Life Sciences,1,2,...,80,1,10,3,3,10,7,1,7,Male-Married
2,37,1,Travel_Rarely,1373,Research & Development,2,2,Other,1,4,...,80,0,7,3,3,0,0,0,0,Male-Single
3,33,0,Travel_Frequently,1392,Research & Development,3,4,Life Sciences,1,5,...,80,0,8,3,3,8,7,3,0,Female-Married
4,27,0,Travel_Rarely,591,Research & Development,2,1,Medical,1,7,...,80,1,6,3,3,2,2,2,2,Male-Married


In [11]:
rates = survival(data, 'Gender_MaritalStatus','Age', 'Attrition')[1:]

rates.iplot(kind='line', xTitle='Age', yTitle='Employee Retention Rates',
            title='Employee Retention Rates by Gender - Marital Status and Age')

## 6. Let's find out how job satisfaction affects retention rates. Generate and plot survival curves for each level of job satisfaction by number of years at the company.

In [12]:
rates = survival(data, 'JobSatisfaction','YearsAtCompany', 'Attrition')

rates.iplot(kind='line', xTitle='Years At Company', yTitle='Employee Retention Rates',
            title='Employee Retention Rates by Job Satisfaction and Years at company')

## 7. Let's investigate whether the department the employee works in has an impact on how long they stay with the company. Generate and plot survival curves showing retention by department and years the employee has worked at the company.

In [13]:
rates = survival(data, 'Department','YearsAtCompany', 'Attrition')

rates.iplot(kind='line', xTitle='Years At Company', yTitle='Employee Retention Rates',
            title='Employee Retention Rates by Department and Years at Company')

## 8. From the previous example, it looks like the sales department has the highest attrition. Let's drill down on this and look at what the survival curves for specific job roles within that department look like.

Filter the data set for just the sales department and then generate and plot survival curves by job role and the number of years at the company.

In [14]:
data_sales = data.loc[data['Department']=="Sales"]
data_sales.head()

Unnamed: 0,Age,Attrition,BusinessTravel,DailyRate,Department,DistanceFromHome,Education,EducationField,EmployeeCount,EmployeeNumber,...,StandardHours,StockOptionLevel,TotalWorkingYears,TrainingTimesLastYear,WorkLifeBalance,YearsAtCompany,YearsInCurrentRole,YearsSinceLastPromotion,YearsWithCurrManager,Gender_MaritalStatus
0,41,1,Travel_Rarely,1102,Sales,1,2,Life Sciences,1,1,...,80,0,8,0,1,6,4,0,5,Female-Single
18,53,0,Travel_Rarely,1219,Sales,2,4,Life Sciences,1,23,...,80,0,31,3,3,25,8,3,7,Female-Married
21,36,1,Travel_Rarely,1218,Sales,9,4,Life Sciences,1,27,...,80,0,10,4,3,5,3,0,3,Male-Single
27,42,0,Travel_Rarely,691,Sales,8,4,Marketing,1,35,...,80,1,10,2,3,9,7,4,2,Male-Married
29,46,0,Travel_Rarely,705,Sales,2,4,Marketing,1,38,...,80,0,22,2,2,2,2,2,1,Female-Single


In [15]:
rates = survival(data_sales, 'JobRole','YearsAtCompany', 'Attrition')

rates.iplot(kind='line', xTitle='Years At Company', yTitle='Employee Retention Rates',
            title='Employee Retention Rates by Role in Sales department and Years at company')

## 9. Let examine how compensation affects attrition.

- Use the `pd.qcut` method to bin the HourlyRate field into 5 different pay grade categories (Very Low, Low, Moderate, High, and Very High).
- Generate and plot survival curves showing employee retention by pay grade and age.

In [16]:
data['HourlyRate_Bins'] = pd.qcut(data['HourlyRate'], 5, labels=['Very Low', 'Low', 'Moderate', 'High','Very High'])
data.head()

Unnamed: 0,Age,Attrition,BusinessTravel,DailyRate,Department,DistanceFromHome,Education,EducationField,EmployeeCount,EmployeeNumber,...,StockOptionLevel,TotalWorkingYears,TrainingTimesLastYear,WorkLifeBalance,YearsAtCompany,YearsInCurrentRole,YearsSinceLastPromotion,YearsWithCurrManager,Gender_MaritalStatus,HourlyRate_Bins
0,41,1,Travel_Rarely,1102,Sales,1,2,Life Sciences,1,1,...,0,8,0,1,6,4,0,5,Female-Single,Very High
1,49,0,Travel_Frequently,279,Research & Development,8,1,Life Sciences,1,2,...,1,10,3,3,10,7,1,7,Male-Married,Moderate
2,37,1,Travel_Rarely,1373,Research & Development,2,2,Other,1,4,...,0,7,3,3,0,0,0,0,Male-Single,Very High
3,33,0,Travel_Frequently,1392,Research & Development,3,4,Life Sciences,1,5,...,0,8,3,3,8,7,3,0,Female-Married,Low
4,27,0,Travel_Rarely,591,Research & Development,2,1,Medical,1,7,...,1,6,3,3,2,2,2,2,Male-Married,Very Low


In [17]:
rates = survival(data, 'HourlyRate_Bins','Age', 'Attrition')[1:]

rates.iplot(kind='line', xTitle='Age', yTitle='Employee Retention Rates',
            title='Employee Retention Rates by pay grade and Age')

## 10. Finally, let's take a look at how the demands of the job impact employee attrition.

- Create a new field whose values are 'Overtime' or 'Regular Hours' depending on whether there is a Yes or a No in the OverTime field.
- Create a new field that concatenates that field with the BusinessTravel field.
- Generate and plot survival curves showing employee retention based on these conditions and employee age.

In [18]:
data['Overtime_or_regularHours'] = data['OverTime']
data['Overtime_or_regularHours'] = data['Overtime_or_regularHours'].replace('Yes','Overtime')
data['Overtime_or_regularHours'] = data['Overtime_or_regularHours'].replace('No','Regular Hours')

data['Overtime_or_Regular_and_BusinessTravel'] = data[['Overtime_or_regularHours', 'BusinessTravel']].apply(lambda x: '-'.join(x), axis=1)

data.head()

Unnamed: 0,Age,Attrition,BusinessTravel,DailyRate,Department,DistanceFromHome,Education,EducationField,EmployeeCount,EmployeeNumber,...,TrainingTimesLastYear,WorkLifeBalance,YearsAtCompany,YearsInCurrentRole,YearsSinceLastPromotion,YearsWithCurrManager,Gender_MaritalStatus,HourlyRate_Bins,Overtime_or_regularHours,Overtime_or_Regular_and_BusinessTravel
0,41,1,Travel_Rarely,1102,Sales,1,2,Life Sciences,1,1,...,0,1,6,4,0,5,Female-Single,Very High,Overtime,Overtime-Travel_Rarely
1,49,0,Travel_Frequently,279,Research & Development,8,1,Life Sciences,1,2,...,3,3,10,7,1,7,Male-Married,Moderate,Regular Hours,Regular Hours-Travel_Frequently
2,37,1,Travel_Rarely,1373,Research & Development,2,2,Other,1,4,...,3,3,0,0,0,0,Male-Single,Very High,Overtime,Overtime-Travel_Rarely
3,33,0,Travel_Frequently,1392,Research & Development,3,4,Life Sciences,1,5,...,3,3,8,7,3,0,Female-Married,Low,Overtime,Overtime-Travel_Frequently
4,27,0,Travel_Rarely,591,Research & Development,2,1,Medical,1,7,...,3,3,2,2,2,2,Male-Married,Very Low,Regular Hours,Regular Hours-Travel_Rarely


In [19]:
rates = survival(data, 'Overtime_or_Regular_and_BusinessTravel', 'Age', 'Attrition')[1:]

rates.iplot(kind='line', xTitle='Age', yTitle='Employee Retention Rates',
            title='Employee Retention Rates by Bussiness Travel and Overtime/Regular Hours and Age')