# Proposal: Predicting Why Employees Leave

## Domain Background
According to the Bureau of Labor Statistics the median number of years that wage and salary workers had been with their current employer was 4.2 years in January 2016. While this number varies from industry to industry the story of an employee who sticks with one company for the entirety of a working life seems to be rather antiquated. 
This observation is combined with the fact, that "employee turnover has been identified as a key issue for organizations because of its adverse impact on work place productivity and long term growth strategies".[1] One of the key issues with a high employee turnover rate, combined with but reaching beyond cultural and sociological effects, is the cost associated with it. Research shows that the replacement cost for an hourly worker can be as high as 50 % of her annual salary. That number increases with the skillset of the worker up to 200% for senior-level workers and surges up to 400% for executive level positions.[2] It becomes obvious that the trend of shorter tenure in addition to high employee turnover rates can be a costly endeavour.

Therefore it gets increasingly important to acquire the necessary tools for employers to understand where its workforce is standing. Additional insights from employer reports, scorecards as well as general statistical information can offer prediction values for companies when it comes to the longevity of jobs. In this research we are trying to predict the likelihood of an employee quitting based on available information and are trying to offer supervised machine learning methods in order to gain actionable insights on how to prevent a high employee turnover.

## Problem Statement
The problem to be solved is detecting the key elements of employee tenure and predicting whether an employee might be quitting her job.	The setup of this research can be seen as a classification problem. Based on a set of features our solution should be able to determine if an employee quits or stays with a company. The core path to solving this problem is a supervised learning approach which will test the relationship between our independent variables and our dependent variable (did an employee leave or stay). The tasks involved are the following:

1. Download and preprocess the kaggle data set about employment (more detailed information about the features are below)
2. Use statistical methods such as descriptive analysis, regression and/or correlation to lay groundwork.
3. Train a clustering algorithm to group employees into different segments.
4. Train a classifier that determines whether an employee has left.
5. Compare our results against the benchmark model.

A desirable problem solution can be quantified by correct detection of potential job quitters and the amount of created intervention opportunities.

This problem affects companies from all sizes. Although it can be said that based on statistics there are certain industry traits that have a higher tenure, relative speaking the tendency to switch jobs has increased over time. The boundaries of this problem of course are the underlying mechanisms of every individual company. These effects can not be generalized as in 'how to prevent' but more as in 'how to predict'. Company culture, direct reports and individual needs might turn out to be far more complicated than a list of features. However, if we fix this problem the path to a more successful and sustainable company culture is given. 
It shall be stated that a mechanism such as this can only work as a supplement to human interaction and empathy skills. 

## Datasets and Inputs
Measuring employee satisfaction is a tough and highly interdependent task. There are a lot of different dimensions in play and turning them into quantifiably format (less to say machine-readable information) can pose a challenge. Information from employee reviews, demographics, balanced scorecards and key performance indicators can offer a first gateway to understanding an employee's desire to leave the company. 

In a dataset published on [kaggle](https://www.kaggle.com/ludobenistant/hr-analytics) we are offered information on current and former employees plus key features of their exmployment status. The dataset consists out of almost 15.000 data points and 10 variables. Unfortunately there is no available codebook besides a brief information on the available inputs. The inputs we will be using are:

* Employee satisfaction level
* Last evaluation
* Number of projects
* Average monthly hours
* Time spent at the company
* Whether they have had a work accident
* Whether they have had a promotion in the last 5 years
* Sales [describing the department]
* Salary

In addition the dataset provides an indicator whether an employee has left. This indicator will be our dependent variable that we try to predict.


## Solution Statement
First we'll do an in-depth exploratory analysis in an attempt to explain the underlying relationships in the data set. Afterwards we are trying to approach this problem on two dimensions by combining unsupervised and supervised learning algorithms:

1. We'll be trying to apply two clustering algorithms namely [K-Means](http://scikit-learn.org/stable/modules/generated/sklearn.cluster.KMeans.html) and [Gaussian Mixture Model](http://scikit-learn.org/stable/modules/generated/sklearn.mixture.GaussianMixture.html#sklearn.mixture.GaussianMixture) in an attempt to add more semantic to the employee set. Since we are operating in a sociological domain, the more we can derive from our feature set, the more human interaction and perception we can gather, the better. By clustering, we might be able to hand our supervised algorithm additional information for a better prediction down the road. In addition to this we might consider adding or engineering additional features from the existing feature set, depending on the questions we raise through this step.
2. Afterwards we're going to use our features and additional cluster-information to predict whether an employee left or stayed. Given we established that we're dealing with labeled data and considering the amount of data points we're inclined to start with a [Linear Support Vector Classifier](http://scikit-learn.org/stable/modules/generated/sklearn.svm.LinearSVC.html#sklearn.svm.LinearSVC) or [Logistic Regression](http://scikit-learn.org/stable/modules/generated/sklearn.linear_model.LogisticRegression.html) and work our way through a [KNeighbors Classifier](http://scikit-learn.org/stable/modules/generated/sklearn.neighbors.KNeighborsClassifier.html#sklearn.neighbors.KNeighborsClassifier) implementation. Afterwards we'll apply some even more sophisticated algorithms such as a [Support Vector Classifier with polynomial or RBF kernel](http://scikit-learn.org/stable/modules/generated/sklearn.svm.SVC.html#sklearn.svm.SVC) or ensemble methods like [Random Forest Classifier](http://scikit-learn.org/stable/modules/generated/sklearn.ensemble.RandomForestClassifier.html#sklearn.ensemble.RandomForestClassifier) or [eXtreme Gradient Boosting](https://pypi.python.org/pypi/xgboost).

In addition we're trying to implement a model that will empower employers not only to detect, which of their own employees is about to leave the company but also determine which of the features contributes the most to a happy company culture. The success of our final solution however will be measured by its predicting accuracy on a held-out test set and the performance of our supervised learning approach.

## Benchmark Model
A lot of research and work has been performed to retain talent and decrease employee turnover rates. From corporate funded research and analytics services of companies such as [3] to management study classics such as [2] or more recent research [3] to name a few. 

There are valid methods for predicting employee turnover rate and results that can be taken as benchmark models. In this research we'll be using the machine learning approach of Punnose and Ajit as a benchmark. Their research performed predictive tasks on company information with an AUC (Area under the Curve) score of .86 on hold-out data as a best result using an XGBoost model.

## Evaluation Metrics
AUC (**A**rea **u**nder the **C**urve) is a common evaluation metric for binary classification problems. Its value is between 0 to 1 and describes the accuracy of a binary classification based on its true positive values. As described [here](https://en.wikipedia.org/wiki/Receiver_operating_characteristic#Area_under_the_curve), an area under the curve score is equal to the probability that a classifier will rank a randomly chosen positive instance higher than a randomly chosen negative one.

To put this more into perspective, [kaggle](https://www.kaggle.com/wiki/AreaUnderCurve) describes a way of viewing AUC as a plot of the true positive rate vs. the false positive rate where the threshold value for classifying an item as 0 or is increased from 0 to 1. If the classifier is very good, the true positive rate will increase quickly and the area under the curve will be close to 1. If the classifier is no better than random guessing, the true positive rate will increase linearly with the false positive rate and the area under the curve will be around 0.5.

## Project Design
After describing the problem, a possible solution and our dataset we would like to spend a couple moments explaining the overall project design. In general the approach will be split up in following chronological steps:

### Exploratory Analysis
To get a better grasp on the information that is included in the data set, we'll start off with a thorough exploratory analysis. After looking at the key metrics of the data we'll start asking basic questions regarding the status quo of our employees. We'll try to figure out which features tend to be more describing than others when it comes to our prediction tasks and therefore we'll be looking at correlations to discover underlying relationships in our data. 

### Data Preprocessing
From a short preview on the data set we can already tell that we are dealing with numeric but also categorical data. In order to make this information machine-readable we'll be applying feature transformation techniques such as one-hot-encoding. For numeric attributes we might consider normalization or calibrating. In addition, depending on our exploratory analysis, we might consider imputation techniques for missing data as well as outlier detection and processing.

### Deciding on algorithms & techniques
Even though we already established the core principle of the research as a supervised learning task we'll be conducting additional research to find the best fitting algorithms for our deemed solution. The list of possible algorithms is long and we'll be trying even some more exotic ones in order to get to the best possible solution.
There is a large set of possibilities on how to tackle this problem but our key metrices to narrow down the list of algorithms will be data-size, computational efficiency and cross-validation scores on the training set.

To come closer to a decision when it comes to the potential algorithm we'll be using let's take a look at the prerequisites of our setup. To apply any kind of machine learning technique you require a data set with over 50 data points, which is given in our case with around 15000 entries. Since we established that we're dealing with labeled data already we'll be looking at supervised learning techniques. 
As mentioned earlier, there are few contenders already that we can pick from but first we need to answer some more questions to pick the right one. While some models work better on large sets, some of them are better suited for a small set of training data. A common threshold for machine learning algorithms is around 100,000 data points. Since the amount of data for our problem lies way below this threshhold, we are more inclined to work with models that perform well on a small set size and can assume that time won't be too much of an issue since we're proceeding our calculations offline. 

As a general note: if the training set is small, classifiers with a high bias and a low variance like Naive Bayes have an advantage over classifier with a low bias but a high variance (which are more likely to overfit) such as for example k-NN.

To narrow down the list of first contenders even more let's take a look at the characteristics of some implementations we've already touched base with.

#### Support Vector Machines
In the domain of classification, with a set below 100,000 data points one common model to use are Support Vector Machines (SVM). SVM tend to have a high accuracy while maintaining a fairly low variance (hence are more unlikely to overfit). In complicated domains where there is a clear margin that separates the classes this model can work really well. Even for cases where the data is not linearly separable the model can perform well and the ability to add an additional dimension for separation (by adding a polynomal kernel) makes it quite versatile. There are some downturns though. If we find out that our data contains a lot of noise or by adding additional observations gets too large, the model tends to perform poorly or very slow. Because of the current amount of data, the clear structure of the data set and the versatility this model most likely will be one of our contenders though.

#### Logistic Regression
The next option we'd like to foreshadow is Logistic Regression (LR). Although the name might imply that we're dealing with a regression model that's actually not the case. LR is used for classification problems such as ours and usually tends to be a good starting solution. The upside regarding SVMs is, that LR tends to work better on large data sets. This fact might not play a key role now but it's worth mentioning. There are additional benefits to this model though that might come in handy. With LR there are a lot of ways to regularize the model and we could use GridSearch in order to really tweak it to fit our needs. Additionally, if we find out that high correlations are hidden in our data set LR usually fends of these effects quite nicely. Another one of its main advantages is, that new data can easily be added and the model can be updated in an ongoing process. A downside of LR, is that it might have difficulties with binary features, which might be a handicap if we are using a lot of one-hot encoded features.
Since LR is used in credit risk analysis, a domain that tries to understand human behaviour, this model might work well for our purposes.

#### Random Forest
Another contender could be a Decision Tree model. Because of its clear structure, Decision Trees are fairly easy to explain and interpret. A decision tree can be thought of as consecutive questions about the data with the goal to maximize information gain while doing so. Essentially we're handing the model a labeled training data set which it uses to build paths that lead to either one of the classes. In our case a simplified path could be portrayed as the following: 

![DT Model](img/DT_Model.png)

The model will ask more informative questions earlier than less informative ones and move its way down the list of features to come to a conclusion. Besides their clear structure Decision Tree's main advantage is their robust behaviour against outliers. We could also prevent overfitting by applying pruning, which esentially limits the tree to a certain depth. 
However we might be reaching for a more advanced implementation of trees. Based on the paradigm of divide-and-conquer the performance of Decision Tree models can be significantly enhanced by using a collective of trees. The main principle is that an implementation of many 'weak learners' can form a 'strong learner' if they combine forces while tackling the task from different angles. There are ensemble methods that incorporate trees such as Random Forest (RF). A RF fits a number of decision tree classifiers on an array of sub-samples of the data and uses sub-sampling of the data set and the feature set combined with averaging to improve predictive accuracy and avoid overfitting. RF could prove itself with its ability to accept non-linear features. Other than LR it can handle categorical features very well. It's also well suited for high dimensional spaces (in case we wanted to add or engineer more features) and large numbers of training examples. A major downside of RF-models though is it's lack of sensitivity towards correlated features. With correlated features, strong features might possibly end up with low scores.

In order to make sure we're avoiding overfitting we'll be implementing cross-validation.

### Defining Models
After deciding on 3 or 4 different supervised algorithms from the set of [Linear SVC](http://scikit-learn.org/stable/modules/generated/sklearn.svm.LinearSVC.html#sklearn.svm.LinearSVC), [Logistic Regression](http://scikit-learn.org/stable/modules/generated/sklearn.linear_model.LogisticRegression.html), [KNeighbors Classifier](http://scikit-learn.org/stable/modules/generated/sklearn.neighbors.KNeighborsClassifier.html#sklearn.neighbors.KNeighborsClassifier) [polynomial SVC](http://scikit-learn.org/stable/modules/generated/sklearn.svm.SVC.html#sklearn.svm.SVC), [Random Forest Classifier](http://scikit-learn.org/stable/modules/generated/sklearn.ensemble.RandomForestClassifier.html#sklearn.ensemble.RandomForestClassifier) or [eXtreme Gradient Boosting](https://pypi.python.org/pypi/xgboost). In order to figure out the best way of handling our prediction we'll apply numerous optimization tasks. This might include selecting, fine-tuning, and combining the best algorithms using techniques such as model fitting, model blending, data reduction, feature selection, and assessing the yield of each model, over the baseline. To avoid overfitting and enable generalization we will be using cross-validation. Once we narrowed down the list of possible candidates we'll be applying grid search to tweak the algorithms for the best parameters given the task.

### Discussion
Finally we'll discuss our model and our results in comparison to mentioned references and benchmark-models. This is, where we want to emphasize the general idea behind our approach and open it up to additional data sets from small to mid-size companies.

# References

* [1] **Punnose R, Pankaj A** (2016) "[Prediction of Employee Turnover in Organizations using Machine Learning Algorithms](https://thesai.org/Downloads/IJARAI/Volume5No9/Paper_4-Prediction_of_Employee_Turnover_in_Organizations.pdf)" in  *International Journal of Advanced Research in Artificial Intelligence, Vol 5.*

* [2] **Weisbeck, D** (2015) "[Fact or Hype: Do Predictive Workforce Analytics Actually Work?](http://www.visier.com/tech-insights/do-predictive-workforce-analytics-actually-work/)" on *visier.com*.

* [3] **Cotton J, Tuttle J** (1986) "[Employee Turnover: A Meta-Analysis and Review With Implications for Research](https://www.researchgate.net/publication/211384381_Employee_Turnover_A_Meta-Analysis_and_Review_With_Implications_for_Research)" in *The Academy of Management Review 11(1)*.