# HR Analytics: Employee Retention

Goals of this mini project:

1. to investigate what reasons potentially made the employees of a company leave their jobs in the past;
2. to build a model that can accurately predict when other employees are about to leave the company due to dissatisfaction.

Needless to say that this kind of information can be very useful for any company, since it helps spotting management deficiencies and avoiding expenses related to employee turnover, among other benefits.

For this task, a data set containing information about ~15000 employees was used. Different learning algorithms are applied to build the predictive models, and their results are evaluated in terms of sensibility, specificity and the area under the Receiver Operating Characteristic (ROC) curve.

## 1 Data load

In [None]:
# The data set is in CSV format.
data <- read.csv('raw_data.csv', header = TRUE)

In [None]:
# Get the size of the data set.
dim(data)

In [None]:
# Basic description.
str(data)

In [None]:
# First 5 observations.
head(data)

In [None]:
# Last 5 observations.
tail(data)

In [None]:
# Min, max, mean, median, quartiles and frequencies.
summary(data)

In [None]:
# Count missing values.
sapply(data, function(column) { sum(is.na(column)) })

So the data set contains exactly 14999 observations and each is described by 12 features:

* `name`: employee's last name.
* `satisfaction_level`: employee's satisfaction, as a 0-1 score.
* `last_evaluation`: employee's score in the last evaluation, as a 0-1 score.
* `number_projects`: number of projects the employee has worked on.
* `average_monthly_hours`: average working hours per month.
* `time_spent_company`: how long (years) the employee works at the company.
* `work_accident`: values 1 if the employee has had any work accident, and 0 otherwise.
* `left`: values 1 if the employee has left the company, and 0 otherwise.
* `promotion_last_5_years`: values 1 if the employee has received a promotion in the last 5 years, and 0 otherwise.
* `department`: department the employee works in.
* `salary`: employee's salary level ("low", "medium", "high").
* `salary_level`: employee's salary level as a number (1, 2, 3).

There are no missing values to be filled.

`left` is the target variable.

In [None]:
# Represent nominal features as factors.
data$work_accident <- factor(data$work_accident, labels = c('no', 'yes'))
data$left <- factor(data$left, labels = c('no', 'yes'))
data$promotion_last_5_years <- factor(data$promotion_last_5_years, labels = c('no', 'yes'))
data$salary <- factor(data$salary, levels = c('low', 'medium', 'high'), ordered = TRUE)

## 2 Exploratory data analysis

In [None]:
library(ggplot2)
library(dplyr)

In [None]:
# Feature names.
names.nominal <- c('work_accident', 'left', 'promotion_last_5_years',
                   'department', 'salary')
names.numerical <- c('satisfaction_level', 'last_evaluation', 'number_projects',
                     'average_monthly_hours', 'time_spent_company')

### Feature distributions

In [None]:
options(repr.plot.width = 7, repr.plot.height = 2)

### Feature correlations

In [None]:
options(repr.plot.width = 7, repr.plot.height = 8)

## 3 Preprocessing

## 4 Model training

## 5 Results