# Employee Churn

We got employee data from a few companies. We have data about all employees who joined from 2011/01/24 to 2015/12/13. For each employee, we also know if they are still at the company as of 2015/12/13 or they have quit. Beside that, we have general info about the employee, such as avg salary during her tenure, dept, and yrs of experience.

As said above, the goal is to predict employee retention and understand its main drivers. Specifically, you should:


1. Assume, for each company, that the headcount starts from zero on 2011/01/23. Estimate employee headcount, for each company on each day, from 2011/01/24 to 2015/12/13. That is, if by 2012/03/02 2000 people have joined company 1 and 1000 of them have already quit, then company headcount on 2012/03/02 for company 1 would be 1000. You should create a table with 3 columns: day, employee_headcount, company_id


- What are the main factors that drive employee churn? Do they make sense? Explain your findings


- If you could add to this data set just one variable that could help explain employee churn, what would that be?


### Data Checking

In [286]:
import pandas as pd
import numpy as np
pandas.set_option('display.max_columns', 10)
pandas.set_option('display.width', 350)
  
#read from google drive
data=pandas.read_csv(".\employee_retention.csv")
  
print(data.head())

   employee_id  company_id              dept  seniority    salary   join_date   quit_date
0      13021.0           7  customer_service         28   89000.0  2014-03-24  2015-10-30
1     825355.0           7         marketing         20  183000.0  2013-04-29  2014-04-04
2     927315.0           4         marketing         14  101000.0  2014-10-13         NaN
3     662910.0           7  customer_service         20  115000.0  2012-05-14  2013-06-07
4     256971.0           2      data_science         23  276000.0  2011-10-17  2014-08-22


In [287]:
# explore data
data.shape

(24702, 7)

In [26]:
data['company_id'].unique()
# only 12 companies in total - company 1 to company 12

array([ 7,  4,  2,  9,  1,  6, 10,  5,  3,  8, 11, 12], dtype=int64)

### Question 1: Create table with employee headcount by day for all companies

In [326]:
# create the dates from 2011/01/24 to 2015/12/13
dates = pd.date_range('2011-01-24', '2015-12-13')
len(dates)

1785

In [457]:
#! [] is empty list, [[]] is empty df
dates_headct_all = [[]]
for i in np.arange(1, 13):
    
    data_cpn = data[data['company_id']==i]
    
    join_dates = data_cpn['join_date']
    quit_dates = data_cpn['quit_date']
    
    join_date_ct = pd.DataFrame({'join_date': join_dates.value_counts().index,
                'join_date_ct': join_dates.value_counts().values})
    join_date_ct['join_date'] = pd.to_datetime(join_date_ct['join_date']) 
    
    quit_date_ct = pd.DataFrame({'quit_date': quit_dates.value_counts().index,
                'quit_date_ct': quit_dates.value_counts().values})
    
    quit_date_ct['quit_date'] = pd.to_datetime(quit_date_ct['quit_date']) 
    
    dates_headct = pd.merge(dates, join_date_ct, how = 'left', left_on = 'dates', right_on = 'join_date')
    
    dates_headct = pd.merge(dates_headct, quit_date_ct, how = 'left', left_on = 'dates', right_on = 'quit_date')
    
    # fill NaNs with 0 in join_date_ct and quit_date_ct to work out the cumulative counts
    dates_headct['join_date_ct'] = dates_headct['join_date_ct'].fillna(0)
    dates_headct['quit_date_ct'] = dates_headct['quit_date_ct'].fillna(0)
    
    dates_headct['join_cum_count'] = dates_headct['join_date_ct'].cumsum()
    dates_headct['quit_cum_count'] = dates_headct['quit_date_ct'].cumsum()
    
    dates_headct['headcount'] = dates_headct['join_cum_count'] - dates_headct['quit_cum_count']
    
    dates_headct['company_id'] = i 
    
    dates_headct_all.append(dates_headct)

In [469]:
# append all together - a bit strange, need to revise
comp_headcnt_by_date = dates_headct_all[1].append([dates_headct_all[2], dates_headct_all[3], dates_headct_all[4], 
                           dates_headct_all[5] , dates_headct_all[6], dates_headct_all[7],
                             dates_headct_all[8], dates_headct_all[9], dates_headct_all[10],
                             dates_headct_all[11], dates_headct_all[12]])
comp_headcnt_by_date[21:40]

Unnamed: 0,dates,join_date,join_date_ct,quit_date,quit_date_ct,join_cum_count,quit_cum_count,headcount,company_id
21,2011-02-14,2011-02-14,25.0,NaT,0.0,129.0,0.0,129.0,1
22,2011-02-15,2011-02-15,1.0,NaT,0.0,130.0,0.0,130.0,1
23,2011-02-16,2011-02-16,2.0,NaT,0.0,132.0,0.0,132.0,1
24,2011-02-17,NaT,0.0,NaT,0.0,132.0,0.0,132.0,1
25,2011-02-18,NaT,0.0,NaT,0.0,132.0,0.0,132.0,1
26,2011-02-19,NaT,0.0,NaT,0.0,132.0,0.0,132.0,1
27,2011-02-20,NaT,0.0,NaT,0.0,132.0,0.0,132.0,1
28,2011-02-21,NaT,0.0,NaT,0.0,132.0,0.0,132.0,1
29,2011-02-22,2011-02-22,24.0,NaT,0.0,156.0,0.0,156.0,1
30,2011-02-23,2011-02-23,3.0,NaT,0.0,159.0,0.0,159.0,1


In [471]:
comp_headcnt_by_date[['dates', 'company_id', 'headcount']].head(15)

Unnamed: 0,dates,company_id,headcount
0,2011-01-24,1,25.0
1,2011-01-25,1,27.0
2,2011-01-26,1,29.0
3,2011-01-27,1,29.0
4,2011-01-28,1,29.0
5,2011-01-29,1,29.0
6,2011-01-30,1,29.0
7,2011-01-31,1,59.0
8,2011-02-01,1,66.0
9,2011-02-02,1,67.0
