# Need of the day:People Analytics 
** Analyze your own employees to identify voluntary attrition **
***
** "Clients do not come first. Employees come first. If you take care of employees, they will take care of the clients" - Richard Branson**

**"I quit..."**, No employer wants to hear these words, but it seems more and more employers get to hear this as employees are leaving for greener pastures. In today’s world, employers have to take enormous efforts to retain their high skilled employees. Per latest statistics from Bureau of labor[2], voluntary quits rose in professional and business services by 82,000. Hence a million-dollar question for an employer is: How can one gauge the signs of flight risk long before a high performer starts looking for a new position? Is there a systematic pattern or trend which can be analyzed to preemptively avoid quits? Can we turn to data and predict which employee will leave and why? If these questions can be answered using data then the employer can give incentives to the employees at risk of leaving. Its a win win situation for both the employer and employee.

### Motivation for this project

Even though companies have perennially faced the problem of attrition, using data to analyze employees and predict attrition is fairly new. Even tech companies with huge data infrastructure never thought of applying data analytics on their own employees. I got interested in this problem while doing literature survey for my capstone project. I found countless articles on linked in to analyze and predict employee turnover. Most of the literature written on People analytics is in the last 5 years. I found this very surprising as business intelligence has been used to solve various business problems for decades [3]. I wanted to create a quick prototype of a statistical model to predict attrition. Since employee data can be huge for big organizations and the number of features required to predict attrition may vary from domain to domain, I was looking for a simpler and smaller dataset to start with. I found this Kaggle competition[1] with simulated dataset[9] as perfect for my requirements.

** Human-centered design considerations which inform my decision to pursue this project**

Data science is a revolution. The more we can solve human centered problems using data and apply scientifc methods in analysing and predicting human data, the better it is for humanity. As per quote from Yogi Berra: “It's tough to make predictions, especially about the future.” One would wonder how difficult it would be then to make predictions about people? This project can provide intriguing insights on what people value the most in their job: whether its money or the number of hours they put in job or how well they are evaluated or the number of projects they work for in a year. Once these insights are gained, results from People Analytics can be useful for improving the managerial level decisions like which factors should be considered to improve the satisfaction level of employees, how to strengthen the relationship between leadership and staff or how much budget should be sanctioned to organize morale events at the organization.

## Defining the Research Questions
***
We aim to answer these questions by applying data science on employee data.
 * Question 1: Which employees are most likely to leave voluntarily? 
 * Question 2: What are the reasons for the employees to leave?
 * Question 3: How to retain these employees?

## Hypothesis
***
 * Hypothesis 1: High performers(good last_evaluation) and low performers(poor last_evaluation) are most likely to leave.
 * Hypothesis 2: Even though 'satisfaction_level' seems to be a giveaway for an employees intent to leave, there are other predictors as strongly correlated to attrition rate. 


## 1. Dataset
***
### 1a. Data Source and License information
The data has been borrowed from Kaggle competition [Human Resources Analytics]('https://www.kaggle.com/ludobenistant/hr-analytics/data'). This dataset is released under CC BY-SA 4.0, the details of this license can be found [here]('https://creativecommons.org/licenses/by-sa/4.0/'). There are 15000 observations in the dataset with 10 features.

### 1b. Get a glimpse of data

In [7]:
import pandas as pd
data = pd.DataFrame.from_csv('HR_comma_sep.csv', index_col=None)
data.head()

Unnamed: 0,satisfaction_level,last_evaluation,number_project,average_montly_hours,time_spend_company,Work_accident,left,promotion_last_5years,sales,salary
0,0.38,0.53,2,157,3,0,1,0,sales,low
1,0.8,0.86,5,262,6,0,1,0,sales,medium
2,0.11,0.88,7,272,4,0,1,0,sales,medium
3,0.72,0.87,5,223,5,0,1,0,sales,low
4,0.37,0.52,2,159,3,0,1,0,sales,low


### 1c. Data Cleaning

As shown above the data is fairly neat. Although there is no data processing needed for the dataset, I plan to change few column names which are misleading. For example 'sales' column is actually the department name for the employee. Similarly 'left' column indicates if the employee has left. Since'left' will be the target variable for building the attrition model, will rename it to 'attrition'. I also plan to find if there are any missing values in data and how to fill them. Will also need to check if any typecasting needs to be done for a column.

## 2. Statistical Analysis
***
Once data is cleaned up and ready to use, will apply certain analytical methods to get a sense of data. Once the data is summarized, will perform corrleation analysis to identify features which will be consumed by the classification model.


### 2a. Descriptive Statistics
First step is to summarize data and find distribution of features in the dataset. For this purpose I would perform Descriptive statistics on data. As a part of descriptive statistics, will find mean, median, standard deviation, min, max, 25th and 75th percentile of columns with quantitaive values.

###  2b. Correlation Analysis
Secondly would try to find the features which are most correlated to attrition. This can be done by doing correlation analysis of all features. Will create a correlation matrix and correlation heatmap to find what features will affect our target variable(attrition rate) the most. Correlation analysis would help us find which features are positively correlated and which ones are negatively correlated with attrition. 

## 3. Feature Engineering
***
Plan to extract top 5 features based on corelation analysis. The top 5 features which are most correlated to attrition will be used in the logistic regression model.

## 4. Training the model
***
Will use logistic regression to classify employees on the basis of their probability to leave. Given our reserach questions, this is an ideal model to use for couple of reasons. First, its a very simple model and python provides ready to use package to apply logistic regression(sklearn.linear_model.LogisticRegression)[10]. Secondly since we not only need to do the classification, but we also want to find the probabilities to group the employees in different bands based on their risk of leaving, we can get the probabilities using sklearn.linear_model.LogisticRegression method 'predict_proba(X)'.To assess the quality of the model, available data will be divided into appropriately sized training and test sets, and these will be varied randomly to insure that the models thus constructed have an optimum trade off of low variance versus low bias. 

### 4a. Classifying employees using Logistic Regression
Logistic Regression commonly deals with the issue of how likely an observation is to belong to each group. This model is commonly used to predict the likelihood of an event occurring. In contrast to linear regression, the output of logistic regression is transformed with a logit function. This makes the output either 0 or 1. This is a useful model to take advantage of for this problem because we are interested in predicting whether an employee will leave (0) or stay (1). 

## 5. Retention Plan Using Logistic Regression
***
We can also use logistic regression to predict probabilities. Once we have probability values from logistic regression, we can use these values to classify employees in different bands:

1.	**Extremely likely** – Employees with highest probability of leaving(>90%). 
2.	**Moderately likely** – Employees with medium probability of leaving (50%< p< 90%).
3.	**Least likely** – Employees with less than 50% probability of leaving(<50%). 

Having identified employees with high risk of leaving, employer can take necessary actions to retain the employee.

## 6. Limitations
* One of the biggest limitation of this dataset is that it does not have data from multiple time stamps. If we had a time element in data, we could have estimated the probability of employee attrition over time. Survival analysis[8] would have been a great statistical measure for this purpose.
* This is a simulated dataset with very limited number of features. Thus, this model will act more like a prototype for building more complex models on real life data.

## References
***
* [1] Kaggle project https://www.kaggle.com/ludobenistant/hr-analytics
* [2] Latest report from Bureau of Labour https://www.bls.gov/news.release/jolts.nr0.htm 
* [3] History of business intelligence https://www.betterbuys.com/bi/history-of-business-intelligence/
* [4] Article on solving attrition with data https://towardsdatascience.com/solving-staff-attrition-with-data-3f09af2694cd 
* [5] Article on descriptive statistics https://www.marsja.se/pandas-python-descriptive-statistics/ 
* [6] Helpful linked in artcile on predicting employee attrition https://www.linkedin.com/pulse/predicting-employee-attrition-who-quit-when-praful-tickoo/
* [7] Helpful linked in artciles on predicting employee attrition  https://www.linkedin.com/pulse/analyzing-employee-turnover-predictive-methods-richard-rosenow-pmp/
* [8] Use of Survival analysis for predicting attrition https://www.slideshare.net/twbriggs/survival-analysis-for-predicting-employee-turnover/24
* [9] Kaggle dataset https://www.kaggle.com/ludobenistant/hr-analytics/data
* [10]sklearn logistic regression http://scikit-learn.org/stable/modules/generated/sklearn.linear_model.LogisticRegression.html


## Attributions
* Link to the license for data https://creativecommons.org/licenses/by-sa/4.0/