# Predicting Turnover with Readily Available Staffing Data
## Intro
Employee turnover is one of the most heavily measured and studied area in Human Resources.  Years ago, I attended a conference where a data scientist took a big data approach and developed an algorithm to predict how likely employees were to still be employed over a twelve month period.  He used objective factors that were readily available in most HR Systems across hundreds of companies and millions of data observations. I used his algorithm and replicated the findings at my company with similar results.

In this study, I wanted to go small scale. Are there any factors available in common Staffing and headcount reports that are good predictors of turnover.  I pulled two headcount reports:  One on the first day of the year and one on the last day of the year.  I merged the two datasets together to see who was still with the company at the end of the reporting period and who left sometime between the first and the last of the year.  This data is also commonly used to calculate Retention metrics (not to be confused with Turnover metrics, which are entirely different).

For this study, I chose to run use Logistic regression given my outcome variable (Active or Termed at end of period) is binary.  My independent variable or factors are a mix of binary and continues variables.



In [33]:
 install.packages("AER")

Installing package into 'C:/Users/David/Documents/R/win-library/3.4'
(as 'lib' is unspecified)


package 'AER' successfully unpacked and MD5 sums checked

The downloaded binary packages are in
	C:\Users\David\AppData\Local\Temp\RtmpYpY1Yv\downloaded_packages


## Data
Two reports were run to build the data for this analysis.  A **Time One** Report and a **Time Two** Report.  The Time One data was pulled at the beginning of the year (e.g., January 1).  The Time Two data was pulled at the end of the year (e.g., December 31).  The two datasets were merged and all data wrangling were handled using MS Excel (*I know*) and exported to a .csv file for import.  

Below diplays the factors used in this analysis.  All of these factors are readily available as it or easily computed from common HR or Staffing reports.  

### Factors
- Terminated:  0 = Still active at the end of the period, 1 = Not active at the end of the period
- MgrChurn:  0 = Same manager at the end of the period as they employee had at the beginning of the period, 1 = Different Manager
- Gender:  0 = Male, 1 = Female
- Minority:  0 = White, 1 = Minority (includes a few not-specified)
- Flsa Type:  0 = Salary, 1 = Hourly
- Seniority:  How long the employee had been with the company at the beginning of the period (in years)
- Age: The age of the employee at the beginning of the period.

In [40]:
mydata <- read.csv("2017_logistic_turnover.csv")
head(mydata)

terminated,mgrChurn,gender,minority,flsaType,seniority,age
0,1,1,1,0,6.99,54.9
0,1,0,1,0,2.37,58.5
0,1,0,1,0,9.18,47.6
0,1,0,1,0,0.49,57.2
0,0,1,1,0,2.97,37.6
0,0,1,1,0,11.84,61.3


In [35]:
summary(mydata)

   terminated        mgrChurn          gender          minority     
 Min.   :0.0000   Min.   :0.0000   Min.   :0.0000   Min.   :0.0000  
 1st Qu.:0.0000   1st Qu.:0.0000   1st Qu.:0.0000   1st Qu.:0.0000  
 Median :0.0000   Median :0.0000   Median :1.0000   Median :0.0000  
 Mean   :0.2527   Mean   :0.3381   Mean   :0.5558   Mean   :0.4964  
 3rd Qu.:1.0000   3rd Qu.:1.0000   3rd Qu.:1.0000   3rd Qu.:1.0000  
 Max.   :1.0000   Max.   :1.0000   Max.   :1.0000   Max.   :1.0000  
    flsaType       seniority           age       
 Min.   :0.000   Min.   :-0.910   Min.   :18.80  
 1st Qu.:0.000   1st Qu.: 0.530   1st Qu.:33.00  
 Median :1.000   Median : 1.450   Median :40.70  
 Mean   :0.706   Mean   : 2.371   Mean   :41.56  
 3rd Qu.:1.000   3rd Qu.: 2.990   3rd Qu.:49.50  
 Max.   :1.000   Max.   :24.430   Max.   :77.40  

## Summary Statistics
The dataset used in this analysis contains 3,197 employees that were active at the beginning of the reporting period.  Twenty-five percent of the employees were not present at the end of the period, resulting in a 75% retention rate.

Below is a quick summary of the percentages for the factors used in this analysis:
- Manager Churn:  34% of employees had a different manager at the end of the period
- Gender:  56% female
- Minority:  50% one or more minority status
- FlsaType:  71% non-exempt or hourly
- Seniority:  Average tenure is almost 2.5 years
- Age:  Average age is almost 42 years

Knowing this dataset, these summary statistics look right which gives me confidence my data wrangling in MS Excel didn't contain many errors.

# Analysis
 Given the structure of the data, logistic regression will be used to identify if there are any significant factors available in the data set.



In [41]:
# Build regression model 
fit.full <- glm(terminated ~ mgrChurn + gender + minority + flsaType + seniority + age, data = mydata, family=binomial)
summary(fit.full)


Call:
glm(formula = terminated ~ mgrChurn + gender + minority + flsaType + 
    seniority + age, family = binomial, data = mydata)

Deviance Residuals: 
   Min      1Q  Median      3Q     Max  
-1.155  -0.830  -0.677   1.285   2.954  

Coefficients:
             Estimate Std. Error z value Pr(>|z|)    
(Intercept) -0.987259   0.203656  -4.848 1.25e-06 ***
mgrChurn    -0.379165   0.091063  -4.164 3.13e-05 ***
gender      -0.089616   0.085808  -1.044   0.2963    
minority     0.021180   0.086214   0.246   0.8059    
flsaType     0.242110   0.101980   2.374   0.0176 *  
seniority   -0.263916   0.026545  -9.942  < 2e-16 ***
age          0.009446   0.003924   2.407   0.0161 *  
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

(Dispersion parameter for binomial family taken to be 1)

    Null deviance: 3614.7  on 3196  degrees of freedom
Residual deviance: 3427.8  on 3190  degrees of freedom
AIC: 3441.8

Number of Fisher Scoring iterations: 5


Signifant p-values for the regression coefficients in the full model include:  Manager Churn, Seniority, FLSA Type, and Age.  A reduced model including only those factors will be run next.

In [24]:
fit.reduced <- glm(terminated ~ mgrChurn + seniority + flsaType + age, data = mydata, family=binomial)
summary(fit.reduced)


Call:
glm(formula = terminated ~ mgrChurn + seniority + flsaType + 
    age, family = binomial, data = mydata)

Deviance Residuals: 
    Min       1Q   Median       3Q      Max  
-1.1737  -0.8312  -0.6824   1.3010   2.9395  

Coefficients:
             Estimate Std. Error z value Pr(>|z|)    
(Intercept) -1.009288   0.195113  -5.173 2.31e-07 ***
mgrChurn    -0.379466   0.091012  -4.169 3.05e-05 ***
seniority   -0.263938   0.026495  -9.962  < 2e-16 ***
flsaType     0.230274   0.100349   2.295   0.0217 *  
age          0.009251   0.003883   2.382   0.0172 *  
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

(Dispersion parameter for binomial family taken to be 1)

    Null deviance: 3614.7  on 3196  degrees of freedom
Residual deviance: 3428.9  on 3192  degrees of freedom
AIC: 3438.9

Number of Fisher Scoring iterations: 5


The following code allows one to compare the full model to the reduced model.  A non-significant Chi Square value indicates the reduced model predicts just as well as the full model.

In [25]:
anova(fit.reduced, fit.full, test = "Chisq")

Resid. Df,Resid. Dev,Df,Deviance,Pr(>Chi)
3192,3428.926,,,
3190,3427.828,2.0,1.098016,0.5775225


In [26]:
# Calculating Regression coefficients
coef(fit.reduced)

In [27]:
# Exponating log(odds) to make them easier to intrepret
exp(coef(fit.reduced))

# Results


Although I'm not an expert at logistic regression (and Robert Kabacoff's book really helped me out), the Exponated Log(odds) make sense to me.  The best predictor in the equation is Manager Churn.  At first, I thought I coded this one incorrectly.  I would have thought that employees would be more likely to leave a company if they had a lot of different managers during the period.  However, that is not what is being measured here.  This factor is just that the employee ended the period with a different manager than they started with at the beginning of the period.  This factor could be a proxy for internal movement, promotion, transfer.  That is, people that are doing well are more likely to transfer and as a result, would be more likely to stay with the company.  Seniority turns out to be an important factor too.  The longer someone has been employed with the company, the more likely they will stay.

FLSA status, or whether an employee is hourly or salary is also an important factor.  Hourly employees are more likely to leave during the period than salary employees.  Given salary jobs tend to pay more and have more responsibility, the direction of this factor makes sense.  Finally, Age does seem to contribute to the model, however in my opinion only slightly.  Older workers are less likely to still be with the company at the end of the period, but at 1.009, only slightly.  Maybe statistically significant, but probably not meaningful.  At least not for this dataset.

So there you have it.  You can gain some insight to who will still be employee twelve months from now just from some basic, everyday Staffing data.